1
00:00:00,012 --> 00:00:04,937
In this video, we're going to talk about 
the Speech Signal and try to understand 

2
00:00:04,937 --> 00:00:08,477
its structure. 
That structure is exploited in all kinds 

3
00:00:08,477 --> 00:00:13,372
of systems, including modern telephone 
systems and cell phone communication 

4
00:00:13,372 --> 00:00:16,354
systems. 
It's really improtant to understand the 

5
00:00:16,354 --> 00:00:18,847
structure. 
It's one thing to have a general 

6
00:00:18,847 --> 00:00:23,519
characterization like, what bandwidth, 
what kind of frequencies does the speech 

7
00:00:23,519 --> 00:00:26,730
signal occupy. 
But there's more structure beyond that, 

8
00:00:26,730 --> 00:00:30,886
that's going to be very important in 
designing efficient communications 

9
00:00:30,886 --> 00:00:35,592
systems For speech. 
So all the, aspects of the speech, 

10
00:00:35,592 --> 00:00:39,132
signal, are determined by how it's 
generated. 

11
00:00:39,132 --> 00:00:44,862
And we're going to develop a model for 
that, what's based on linear signals and 

12
00:00:44,862 --> 00:00:47,417
systems. 
Things we already know. 

13
00:00:47,417 --> 00:00:52,872
And, this is going to lead us to look at 
the spectral structure of speech. 

14
00:00:52,872 --> 00:00:58,372
Turns out in the time domain you can only 
gain so much information from examining a 

15
00:00:58,372 --> 00:01:02,147
speech signal. 
But if you look at its spectrum, you look 

16
00:01:02,147 --> 00:01:06,847
at its 4 a transform, then all of a 
sudden a lot of information pops out 

17
00:01:06,847 --> 00:01:12,372
that's very, very useful in understanding 
what is going on and what is being said. 

18
00:01:12,372 --> 00:01:16,912
So, how is speech generated? So here's my 
rather crude drawing. 

19
00:01:16,912 --> 00:01:25,232
Of the speech production system. 
Everything begins with the lungs, and the 

20
00:01:25,232 --> 00:01:33,372
lungs are providing a source of air 
pressure, and, the, important part for us 

21
00:01:33,372 --> 00:01:37,797
And understaning speech, for the vocal 
chords. 

22
00:01:37,797 --> 00:01:44,602
Now seeing from the top, the vocal 
chords, look like, the structure looks 

23
00:01:44,602 --> 00:01:50,647
something like this. 
Where these are tendon like things. 

24
00:01:50,647 --> 00:01:54,842
And in between, there's a slit, that's 
open. 

25
00:01:54,842 --> 00:02:02,297
And, when you breath, the, vocal chords 
this part, is not under any tension, 

26
00:02:02,297 --> 00:02:07,977
they're loose, the slit opens up, and you 
breath normally. 

27
00:02:07,977 --> 00:02:16,299
When you want to say something, what 
happens Is that, these come under 

28
00:02:16,299 --> 00:02:20,563
tension. 
They are pulled. 

29
00:02:20,563 --> 00:02:28,981
And that closes up this slit. 
Alright? And in fact they become 

30
00:02:28,981 --> 00:02:36,115
completely closed and it's not until the 
air pressure from the lungs boils up 

31
00:02:36,115 --> 00:02:43,062
enough that forces this slit to open 
momentarily, releasing a puff of air, 

32
00:02:43,062 --> 00:02:48,822
which goes into the vocal tract. 
The vocal chords close. 

33
00:02:48,822 --> 00:02:54,197
Air pressure builds up again, it released 
rather quickly giving another puff of 

34
00:02:54,197 --> 00:02:57,622
air, and that's what I've tried to 
indicate there. 

35
00:02:57,622 --> 00:03:02,647
So if you were to plot as a function of 
time, the air pressure just above the 

36
00:03:02,647 --> 00:03:07,747
vocal cords, you would get nothing for a 
while because the pressure below is 

37
00:03:07,747 --> 00:03:12,847
building up and finally it releases, 
releasing that puff of air and it comes 

38
00:03:12,847 --> 00:03:17,560
back down. 
And then the pressure builds up from the 

39
00:03:17,560 --> 00:03:22,100
lungs again, and it pops up again, and 
releases. 

40
00:03:22,100 --> 00:03:29,462
And roughly, it produces a periodic 
pulse-like sequence from the vocal cords. 

41
00:03:29,462 --> 00:03:34,559
And this period is known as the pitch 
period. 

42
00:03:34,559 --> 00:03:40,623
And we're going to talk more about that 
in just a second. 

43
00:03:40,623 --> 00:03:46,332
Well, now what happens is that these 
puffs of air. 

44
00:03:46,332 --> 00:03:53,528
Go into the, what's called the vocal 
tract, which is formed by all of these 

45
00:03:53,528 --> 00:04:00,971
structures, the tongue the lips, the 
opening of your mouth, all kinds of 

46
00:04:00,971 --> 00:04:05,403
things. 
And if I were to draw this again, so we 

47
00:04:05,403 --> 00:04:09,599
might have something that looks like 
this. 

48
00:04:09,599 --> 00:04:15,167
Opens up. 
kind of has some teeth, here at the end. 

49
00:04:15,167 --> 00:04:21,714
And what's happening, those puffs of air 
are coming in there. 

50
00:04:21,714 --> 00:04:28,189
What acoustically, what your mouth looks 
like, is a, a pipe. 

51
00:04:28,189 --> 00:04:36,712
It looks like If I was to straighten this 
out, it would look like a pipe who has a 

52
00:04:36,712 --> 00:04:44,862
cross sectional area that's varying. 
So, this is the length of the pipe and 

53
00:04:44,862 --> 00:04:52,582
this is it's cross section. 
And acoustically, your mouth looks like. 

54
00:04:52,582 --> 00:04:57,888
A straight pipe which has a complicated 
structure. 

55
00:04:57,888 --> 00:05:05,207
Well, what happens now is the air 
pressure signal comes in here and it, the 

56
00:05:05,207 --> 00:05:10,970
word is it excites the. 
Air pressure, inside the vocal tract. 

57
00:05:10,970 --> 00:05:16,712
Well that, turns out, we can describe 
this as a linear filter, having peaks, 

58
00:05:16,712 --> 00:05:22,597
that are called resonances, and, this is 
functions in many ways, like an organ 

59
00:05:22,597 --> 00:05:26,089
pipe. 
Organ pipe is a straight tube, has a very 

60
00:05:26,089 --> 00:05:31,592
simple resonant structure. 
You, the way a note sounds on an organ. 

61
00:05:31,592 --> 00:05:36,692
As you push the key what happens? Air is 
forced into one of those pipes and it 

62
00:05:36,692 --> 00:05:42,492
gives you this nice, not quite pure tone 
but resonance structure having harmonics 

63
00:05:42,492 --> 00:05:46,642
of some fundamental. 
Because of the variable cross sectional 

64
00:05:46,642 --> 00:05:51,642
area here it turns out resonance 
structure become more complicated. 

65
00:05:51,642 --> 00:05:55,217
And that is what we are going to see in 
the next slide. 

66
00:05:55,217 --> 00:06:00,747
So let's summarize what the speech model 
Is at least in terms of a system 

67
00:06:00,747 --> 00:06:04,132
theoretic model. 
So, we have the lungs. 

68
00:06:04,132 --> 00:06:09,442
And what they're producing is actually a 
pretty boring signal. 

69
00:06:09,442 --> 00:06:15,182
It's basically a constant. 
The vocal cords will either be open while 

70
00:06:15,182 --> 00:06:19,966
you're breathing or. 
They will be on, put under tension, when 

71
00:06:19,966 --> 00:06:25,101
you're trying to speak and that means 
there's some neural control coming from 

72
00:06:25,101 --> 00:06:30,237
the brain that's controlling that. 
The, that, where the point of your speech 

73
00:06:30,237 --> 00:06:35,427
is the only thing we care about, that 
produces What we're going to describe is 

74
00:06:35,427 --> 00:06:39,117
a periodic pulse train, a periodic pulse 
sequence. 

75
00:06:39,117 --> 00:06:44,507
Something we've already talked about. 
That serves as the input to the vocal 

76
00:06:44,507 --> 00:06:48,087
tract which according to what you want to 
say. 

77
00:06:48,087 --> 00:06:53,687
The position of the tongue, the lips 
everything else is again under neural 

78
00:06:53,687 --> 00:06:57,362
control and that determines what is being 
said. 

79
00:06:57,362 --> 00:07:02,783
And the result is the speech signal that 
you pick up with a microphone. 

80
00:07:02,783 --> 00:07:09,422
Well let's examine this in more detail to 
figure out important characteristics of 

81
00:07:09,422 --> 00:07:13,525
these signals. 
So here's that model again and I want to 

82
00:07:13,525 --> 00:07:19,048
start with the. 
input here the periodics pitch signal 

83
00:07:19,048 --> 00:07:25,567
which is a periodic pulse train, we've 
already talked about this many times. 

84
00:07:25,567 --> 00:07:32,040
So it's consistent but a set of pulses 
quite narrow and T is what's know as the 

85
00:07:32,040 --> 00:07:36,635
pitch period. 
But what people normally refer to as the 

86
00:07:36,635 --> 00:07:44,091
pitch is the pitch frequency which is 1/T 
and in the speech world that's called F0, 

87
00:07:44,091 --> 00:07:50,244
and that's the pitch. 
So as you change the tension of your 

88
00:07:50,244 --> 00:07:57,489
vocal cords you can make your pitch go up 
or go down if you loosen it. 

89
00:07:57,489 --> 00:08:02,463
and that's all again under. 
Neural control. 

90
00:08:02,463 --> 00:08:09,150
Now, that, signal serves as the input 
into what we're going to describe as a 

91
00:08:09,150 --> 00:08:14,053
linear filter. 
Which has a transfer function which will 

92
00:08:14,053 --> 00:08:20,343
change according to the vowel or the 
speech sound you're trying to make. 

93
00:08:20,343 --> 00:08:25,717
So. 
I have here, 2, sounds, vowel sounds that 

94
00:08:25,717 --> 00:08:32,912
I chose, the O and the E, and this is 
what I'm plotting here is the transfer 

95
00:08:32,912 --> 00:08:39,410
function, the vocal tract. 
So an O sound has a series of peaks, 

96
00:08:39,410 --> 00:08:43,882
these are the resonances I was referring 
to. 

97
00:08:43,882 --> 00:08:49,587
And so does the E sound. 
It has a series of peaks, but you can 

98
00:08:49,587 --> 00:08:56,392
see, they have a very different 
structure, and, let's look at that 

99
00:08:56,392 --> 00:09:02,792
structure in more detail. 
In the speech world, each of these peaks 

100
00:09:02,792 --> 00:09:08,942
is called a formant, and the formants are 
determined by their. 

101
00:09:08,942 --> 00:09:12,427
Frequencies, which are just numbered 
sequentially. 

102
00:09:12,427 --> 00:09:17,695
So the lowest the, the frequency of the 
peak having, at the lowest frequency 

103
00:09:17,695 --> 00:09:22,673
calling at 1, then next higher up is 
called at 2, the next one's at 3, then at 

104
00:09:22,673 --> 00:09:25,501
4. 
So their all just numbered sequentially 

105
00:09:25,501 --> 00:09:31,196
according to the order. 
they appear from low to high in the 

106
00:09:31,196 --> 00:09:37,300
speech spectrum. 
So the o has a moderately high f 1, a 

107
00:09:37,300 --> 00:09:42,759
very low f 2, f 3 is shifted up making a 
nice valley. 

108
00:09:42,759 --> 00:09:49,732
And then there's f 4 and f 5. 
For the e sound, f 1 is lower than it is 

109
00:09:49,732 --> 00:09:53,502
for the o. 
F2 is all the way up here, so that 

110
00:09:53,502 --> 00:09:57,212
resonance structure has really changed a 
lot. 

111
00:09:57,212 --> 00:10:03,292
F3 is a little higher than it was for the 
O and then F4 and F5 are roughly at the 

112
00:10:03,292 --> 00:10:04,722
same place. 
So. 

113
00:10:04,722 --> 00:10:12,291
You can actually look at the spectra and 
figure out what the vowel is that's being 

114
00:10:12,291 --> 00:10:16,097
said. 
Now if you look in the time domain, 

115
00:10:16,097 --> 00:10:23,131
things are not quite so clear. 
Here is a segment of speech corosponding 

116
00:10:23,131 --> 00:10:29,368
to me saying oh and a The segment of 
speech corresponding to me saying e. 

117
00:10:29,368 --> 00:10:35,932
Now what you should recall is back when 
we talked about sending this periodic 

118
00:10:35,932 --> 00:10:39,770
pulse sequence through our r c low pass 
filter. 

119
00:10:39,770 --> 00:10:46,222
What we got out was again something that 
was periodic, kind of looked like this. 

120
00:10:46,222 --> 00:10:50,287
And so we should expect something similar 
here. 

121
00:10:50,287 --> 00:10:55,291
We have the same pulse sequence going 
into a linear filter. 

122
00:10:55,291 --> 00:11:02,198
It's a bit more complicated than a simple 
rc lowpass, but again we should see a 

123
00:11:02,198 --> 00:11:09,330
periodic Output so I used this example at 
the beginning of the course when we 

124
00:11:09,330 --> 00:11:17,102
talked about signals, and you should be 
it should, the period should be readily 

125
00:11:17,102 --> 00:11:21,316
evident. 
I'd like you to tell me what the period 

126
00:11:21,316 --> 00:11:26,692
is for the vowel. 
E in this case, okay? I get a period of 

127
00:11:26,692 --> 00:11:31,930
roughly 10 milliseconds, this scale is in 
seconds. 

128
00:11:31,930 --> 00:11:40,450
the separation from these peaks is about 
10 milliseconds, which makes the pitch 

129
00:11:40,450 --> 00:11:41,806
100. 
Hertz. 

130
00:11:41,806 --> 00:11:49,089
Now, another questions for you. 
let me clear the, slide so we can see 

131
00:11:49,089 --> 00:11:52,852
things. 
What, is the pitch here, and in 

132
00:11:52,852 --> 00:12:00,487
particular, is it higher or lower, than 
it is for the e? And when I talk about 

133
00:12:00,487 --> 00:12:08,644
pitch, we really usually you, worry about 
the frequency not The, interval, but the 

134
00:12:08,644 --> 00:12:14,111
frequency. 
Is the pitch frequency lower, or higher, 

135
00:12:14,111 --> 00:12:22,828
for the o? Okay? I think it's pretty 
evident, that the pitch, is frequency is 

136
00:12:22,828 --> 00:12:27,677
lower. 
Because there are fewer there's only 1 

137
00:12:27,677 --> 00:12:34,790
period there, and I see at least 2 there, 
so in a shorter period of time. 

138
00:12:34,790 --> 00:12:39,867
So the pitch is lowered a little bit 
below, 100 hertz. 

139
00:12:39,867 --> 00:12:47,412
So, what should we expect, the, spectrum 
of the speech to look like? Well, as we 

140
00:12:47,412 --> 00:12:56,043
know, the spectrum Is equal to the 
transfer function times the foray 

141
00:12:56,043 --> 00:13:02,533
transform,. 
In this case the foray series of the 

142
00:13:02,533 --> 00:13:08,172
pitch signal. 
So, you should expect, the, transfer 

143
00:13:08,172 --> 00:13:15,012
function, h of f, which is one of these, 
to be multiplied times the Fourier series 

144
00:13:15,012 --> 00:13:18,992
for this, which consists of a set of 
harmonics. 

145
00:13:18,992 --> 00:13:22,757
The harmonics, they're all harmonics of f 
0. 

146
00:13:22,757 --> 00:13:28,717
So, our, model for speech, is a little 
bit simpler than what I've show. 

147
00:13:28,717 --> 00:13:34,392
We just worry about. 
What the pitch sig, signal is, basically 

148
00:13:34,392 --> 00:13:38,657
what it's period is, what it's pitch 
frequency is. 

149
00:13:38,657 --> 00:13:44,902
And then we worry, about what the 
transfer function is, and, that is how we 

150
00:13:44,902 --> 00:13:52,707
describe the speech Signal. 
And let me show you what an actual speech 

151
00:13:52,707 --> 00:14:02,402
spectrum looks like and now it makes a 
little bit more sense what we're seeing. 

152
00:14:02,402 --> 00:14:11,605
So you can see these peaks in the 
spectrum of our periodic pulse train, is 

153
00:14:11,605 --> 00:14:17,960
going to be a set of lines, decaying like 
a sinc function. 

154
00:14:17,960 --> 00:14:23,452
At the harmonics of f 0. 
So that, would be at F0. 

155
00:14:23,452 --> 00:14:28,304
2F0, 3F0, etcetera. 
And then it corresponds to each of these 

156
00:14:28,304 --> 00:14:31,242
peaks. 
And they will go up forever, but 

157
00:14:31,242 --> 00:14:35,683
generally go down. 
In some places it gets rather hard to see 

158
00:14:35,683 --> 00:14:41,132
that peak structure and that's because 
the energy is getting rather low. 

159
00:14:41,132 --> 00:14:47,630
and furthermore, that departs from the 
shape of the transfer function a little 

160
00:14:47,630 --> 00:14:53,887
bit and that's because the sinc function. 
Character of the 3 coefficients is 

161
00:14:53,887 --> 00:14:58,272
reducing it. 
And so you can see that the amplitude of 

162
00:14:58,272 --> 00:15:03,987
these pitch lines is multiplied by 
something that does look like the 

163
00:15:03,987 --> 00:15:08,932
transfer function for the letter 'O' and 
so now we can see. 

164
00:15:08,932 --> 00:15:13,595
what's going on. 
We should be able to look at a spectrum, 

165
00:15:13,595 --> 00:15:19,327
and see what the pitch is, and get a 
rough idea of what the format structure 

166
00:15:19,327 --> 00:15:22,796
is. 
At least saying generally where's F1, Is 

167
00:15:22,796 --> 00:15:27,357
it high or low? Is that too really high 
or really low. 

168
00:15:27,357 --> 00:15:32,382
And I think you can see that fairly 
easily form this plot. 

169
00:15:32,382 --> 00:15:39,007
That is why this frequency dominated 
speech the structure of speech is much 

170
00:15:39,007 --> 00:15:42,932
more apparent that it is in the time 
domain. 

171
00:15:42,932 --> 00:15:47,282
All right. 
So, here is what we call the Speech 

172
00:15:47,282 --> 00:15:51,752
Spectrogram. 
And this is a special plot that plots 

173
00:15:51,752 --> 00:16:00,122
Frequency on the vertical axis and what 
it's plotting at each moment in time is 

174
00:16:00,122 --> 00:16:07,360
the spectrum as a heat map. 
So, there's a spectrum corresponding to 

175
00:16:07,360 --> 00:16:15,632
each column of the image here and the 
amplitude of the spectrum is encoded as a 

176
00:16:15,632 --> 00:16:19,597
color. 
So The very deep red corresponds to the 

177
00:16:19,597 --> 00:16:24,577
highest amplitudes. 
The yellowish are smaller than that. 

178
00:16:24,577 --> 00:16:30,827
The greens are even smaller than that, 
and the blues are their smallest areas. 

179
00:16:30,827 --> 00:16:37,387
Essentially the temperature reflects 
amplitude, and that's why it's called a 

180
00:16:37,387 --> 00:16:40,232
heat map of course. 
All right, so. 

181
00:16:40,232 --> 00:16:46,976
What we see, below, is the time demand, 
plot of the, speech signal. 

182
00:16:46,976 --> 00:16:53,929
And that's me saying Rice University. 
And, then there are some segments where 

183
00:16:53,929 --> 00:16:59,630
you can roughly make out the periodic 
nature of this speech signal. 

184
00:16:59,630 --> 00:17:05,139
The scale here is highly compressed, but 
going over 1.2 seconds. 

185
00:17:05,139 --> 00:17:11,476
So it is very hard to see. 
what's much clearer is what's going on in 

186
00:17:11,476 --> 00:17:17,001
the frequency domain. 
Makes everything much clearer to me. 

187
00:17:17,001 --> 00:17:22,042
So the first thing I want to point out 
are these lines. 

188
00:17:22,042 --> 00:17:27,735
Cross here, and we now know those are the 
pitch lines. 

189
00:17:27,735 --> 00:17:34,501
That corresponds to the pitch. 
They're harmonics of the pitch. 

190
00:17:34,501 --> 00:17:38,629
Over here. 
Looks to me like the pitch is going up. 

191
00:17:38,629 --> 00:17:44,199
Alright, cause the lines are getting, 
moving up to higher frequencies. 

192
00:17:44,199 --> 00:17:49,931
So that's been saying, rise, rise. 
Here it looks like they're going down a 

193
00:17:49,931 --> 00:17:54,096
little bit. 
Over here, they're relatively constant. 

194
00:17:54,096 --> 00:17:58,882
So you can get an idea of how this 
speaker is saying the words. 

195
00:17:58,882 --> 00:18:05,115
By looking at those pitch contours. 
Now, what's more important to 

196
00:18:05,115 --> 00:18:12,080
understanding it atleast in English, what 
is being said are the, the formats. 

197
00:18:12,080 --> 00:18:16,621
So, I see here a that's probably F1. 
Here is F2. 

198
00:18:16,621 --> 00:18:23,408
So F1 shows the constant, F2 is moving 
up, and this could be a 3, and then over 

199
00:18:23,408 --> 00:18:28,741
here, it could be a 4. 
Here you can see, F2 really moving up. 

200
00:18:28,741 --> 00:18:35,378
F1 maybe moving down a little bit and 
here's F2 really moving up and F1 moving 

201
00:18:35,378 --> 00:18:40,722
down, when I'm saying T. 
And so we can get a very detailed idea of 

202
00:18:40,722 --> 00:18:46,972
what's going on in this speech signal. 
In fact there are experts out there that 

203
00:18:46,972 --> 00:18:53,042
if you give them a spectrogram, and tell 
them what the language that is being 

204
00:18:53,042 --> 00:18:58,932
spoken, then they can tell you what's 
being said and whether it's a male or 

205
00:18:58,932 --> 00:19:04,589
female that's saying it. 
yet because I know it's me I have some 

206
00:19:04,589 --> 00:19:10,545
idea of scale and I can tell you the 
pitch is much more indicative of a male 

207
00:19:10,545 --> 00:19:14,785
than a female. 
Females have a higher range of pitch 

208
00:19:14,785 --> 00:19:19,490
frequencies than males. 
This one is so low, about 100 Hertz, 

209
00:19:19,490 --> 00:19:24,753
that, that probably is a male. 
If it goes much higher, higher than about 

210
00:19:24,753 --> 00:19:29,941
120 Hertz, then that's probably a female 
and children are even higher. 

211
00:19:29,941 --> 00:19:35,771
Now, I want to point out some things, and 
that is, there are some aspects of this 

212
00:19:35,771 --> 00:19:43,375
that aren't Don't fit the speech model 
and don't have pitch lines, and that's 

213
00:19:43,375 --> 00:19:50,695
corresponding to these areas. 
And that's when you're saying '[SOUND]' 

214
00:19:50,695 --> 00:19:56,703
[SOUND], and that corresponds to what we 
call Fricatives. 

215
00:19:56,703 --> 00:20:03,282
And these are words in which the vocal 
chords open, the lung 

216
00:20:03,282 --> 00:20:09,653
air pressure from the lungs just hits the 
vocal track, and then by, making closing 

217
00:20:09,653 --> 00:20:14,477
the teeth, or the lips, you, create 
essentialy, a turbulent noise. 

218
00:20:14,477 --> 00:20:20,274
It doesn't have much structure in the 
time domain, but in the frequency domain 

219
00:20:20,274 --> 00:20:29,042
it occurs at high Frequencies in this 
plotted is occurring higher than any of 

220
00:20:29,042 --> 00:20:34,129
the formans. 
And that is pretty typical. 

221
00:20:34,129 --> 00:20:43,972
It's very interesting that in the old 
days, in the days of analog telephones, 

222
00:20:43,972 --> 00:20:51,886
that they would low pass filter the 
speech signal with a cutoff frequency of 

223
00:20:51,886 --> 00:20:57,840
about 3,300 hertz, cutting out everything 
above that. 

224
00:20:57,840 --> 00:21:04,897
Now it turns out that is high enough but 
it doesn't really affect the format 

225
00:21:04,897 --> 00:21:09,787
structure all that much. 
At least, in terms of understanding 

226
00:21:09,787 --> 00:21:16,192
what's being said, but for the fricative 
signals definitely makes it very hard. 

227
00:21:16,192 --> 00:21:22,265
I know in the old days, it was very hard 
to tell words apart That are differ only 

228
00:21:22,265 --> 00:21:26,569
in their fricative. 
Fix and six were incredibly difficult to 

229
00:21:26,569 --> 00:21:32,179
tell apart over old telephone systems. 
And in fact, I suggest you try it, try it 

230
00:21:32,179 --> 00:21:35,639
with a friend. 
Call them up and say these words in 

231
00:21:35,639 --> 00:21:39,223
isolation. 
And see if your friend can understand 

232
00:21:39,223 --> 00:21:42,772
what you're saying. 
Now, of course, if you put these in a 

233
00:21:42,772 --> 00:21:47,155
sentence you can easily tell them apart. 
I, you can say I fixed my car. 

234
00:21:47,155 --> 00:21:50,836
They certainly, you certainly wouldn't 
say I sixed my car. 

235
00:21:50,836 --> 00:21:54,826
It doesn't make any sense. 
So if you say them out of context, in 

236
00:21:54,826 --> 00:21:57,188
isolation, one word at a time. 
you're. 

237
00:21:57,188 --> 00:22:01,614
I think you're going to find and in some 
cases you're going to have a hard time 

238
00:22:01,614 --> 00:22:05,082
telling them apart. 
If your phone system has a much higher 

239
00:22:05,082 --> 00:22:10,394
cutoff frequency, then those get through. 
So you can judge what's going on in the 

240
00:22:10,394 --> 00:22:14,902
phone system by playing the, a little 
word recognition game. 

241
00:22:14,902 --> 00:22:20,076
Now, there's something else I want to 
point out here, and that is the highest 

242
00:22:20,076 --> 00:22:23,787
frequency in this plot is about 5,500 
hertz. 

243
00:22:23,787 --> 00:22:29,432
And we'll go into why that is when we 
talk about digital signal processing. 

244
00:22:29,432 --> 00:22:35,577
it turns out there are some signals, 
speech signals, that you can have higher 

245
00:22:35,577 --> 00:22:39,909
frequency content. 
and we usually say that the bandwidth is 

246
00:22:39,909 --> 00:22:43,739
about 6-7 kHz. 
It's very approximate, because you are 

247
00:22:43,739 --> 00:22:49,145
seeing the transfer function for the 
vowels is going down, as frequency goes 

248
00:22:49,145 --> 00:22:54,112
up generally, beyond about 4-5 kHz. 
It's definitely going down, so the 

249
00:22:54,112 --> 00:22:58,435
bandwidth is only a rough measure of 
what's going on. 

250
00:22:58,435 --> 00:23:03,619
So, we now can see a lot of things about 
the structure of speech. 

251
00:23:03,619 --> 00:23:09,606
Now I claim, if you look at the 
Spectrogram, there's something not speech 

252
00:23:09,606 --> 00:23:15,058
like in the Spectrogram. 
There's something here that isn't right. 

253
00:23:15,058 --> 00:23:20,943
And in particular. 
I want to point out this line, it's going 

254
00:23:20,943 --> 00:23:26,224
across here. 
What in the world does that correspond 

255
00:23:26,224 --> 00:23:34,162
to? There are two answers, one is what 
kind of signal would produce that line, 

256
00:23:34,162 --> 00:23:38,742
and what's physically producing that 
line. 

257
00:23:38,742 --> 00:23:47,319
Okay, now, physically what that looks 
like, in term of signals, it turns out 

258
00:23:47,319 --> 00:23:55,296
it's, looks line a sinewave, although 
frequency of that, oh, 1800 Hertz. 

259
00:23:55,296 --> 00:23:59,180
Well that's pretty high. 
It's certainly not power frequency and it 

260
00:23:59,180 --> 00:24:01,773
turns out that's fan noise from my 
computer. 

261
00:24:01,773 --> 00:24:06,141
I couldn't get my computer to turn its 
fan off when I made this recording and 

262
00:24:06,141 --> 00:24:09,641
you're picking that up. 
And you can even see there's another 

263
00:24:09,641 --> 00:24:12,131
piece of it up there at a higher 
frequency. 

264
00:24:12,131 --> 00:24:15,697
Okay. 
So, we now know what the structure of 

265
00:24:15,697 --> 00:24:19,958
speech is. 
First of all is has a band width of about 

266
00:24:19,958 --> 00:24:25,033
6 or 7 kilohertz and that's important in 
many applications. 

267
00:24:25,033 --> 00:24:31,562
the structure of speech is best seen in 
the Frequency domain there are some time 

268
00:24:31,562 --> 00:24:36,442
domain things you can pick out like what 
the pitch period is it could be pretty 

269
00:24:36,442 --> 00:24:38,862
easy. 
But it's hard to see how pitch is 

270
00:24:38,862 --> 00:24:43,952
changing very easily with time until you 
look at the spectrogram, you look in the 

271
00:24:43,952 --> 00:24:49,371
frequency domain. 
I want to point out,that many other 

272
00:24:49,371 --> 00:24:59,582
signals,share the same frequency range as 
pitch and signal processing systems they, 

273
00:24:59,582 --> 00:25:05,831
They exploit in great detail this special 
structure of speech, so that they can 

274
00:25:05,831 --> 00:25:11,649
send speech signals efficiently. 
And these are usually called vocoders, 

275
00:25:11,649 --> 00:25:19,157
and that is short poor voice coder. 
They typically, take speech in a way I 

276
00:25:19,157 --> 00:25:24,077
like to put it. 
Rip apart the speech signal into the 

277
00:25:24,077 --> 00:25:30,612
pitch part and the vocal track part. 
They actually try to split that signal up 

278
00:25:30,612 --> 00:25:34,832
into those 2 pieces. 
That voice coder is embedded into cell 

279
00:25:34,832 --> 00:25:40,792
phones and in particular, they try, they 
use very efficient way at computing. 

280
00:25:40,792 --> 00:25:45,934
Communicating with the vocal tract is, 
more more so than just sending the raw 

281
00:25:45,934 --> 00:25:50,593
speech signal that would, you would 
normally do if you just had some regular 

282
00:25:50,593 --> 00:25:54,955
six of seven kilohertz signal. 
Now it turns out voice, vocecoders have 

283
00:25:54,955 --> 00:25:59,147
slipped into the music arena. 
And I'm sure if you've heard modern 

284
00:25:59,147 --> 00:26:02,852
hip-hop, what they're doing is exactly 
what vocoder does. 

285
00:26:02,852 --> 00:26:08,107
They're ripping apart the signal into the 
vocal track part and, and the pitch part. 

286
00:26:08,107 --> 00:26:11,322
They're playing with the pitch part a 
little bit. 

287
00:26:11,322 --> 00:26:15,752
Distorting it from being a pure set of 
pulses, and then putting it through a 

288
00:26:15,752 --> 00:26:18,927
filter that looks like the vocal tract 
they measured. 

289
00:26:18,927 --> 00:26:22,577
And it turns out the music sounds, at 
least to my ears, weird. 

290
00:26:22,577 --> 00:26:27,777
but that's how voice coders work. 
And by the way, in a few modern computer 

291
00:26:27,777 --> 00:26:33,005
technology and signal processing systems 
are such that you can do this in real 

292
00:26:33,005 --> 00:26:36,024
time. 
You can actually do this while the singer 

293
00:26:36,024 --> 00:26:41,277
is singing, it's kind of interesting. 
But for our purposes what they're doing 

294
00:26:41,277 --> 00:26:45,716
and what's important is they're 
exploiting this special structure of 

295
00:26:45,716 --> 00:26:48,831
speech. 
And you will learn that the more you know 

296
00:26:48,831 --> 00:26:52,407
about signals. 
More effectively and efficiently you can 

297
00:26:52,407 --> 00:26:54,405
communicate those kind of signals.