In this video, we're going to talk about 
the Speech Signal and try to understand 
its structure. 
That structure is exploited in all kinds 
of systems, including modern telephone 
systems and cell phone communication 
systems. 
It's really improtant to understand the 
structure. 
It's one thing to have a general 
characterization like, what bandwidth, 
what kind of frequencies does the speech 
signal occupy. 
But there's more structure beyond that, 
that's going to be very important in 
designing efficient communications 
systems For speech. 
So all the, aspects of the speech, 
signal, are determined by how it's 
generated. 
And we're going to develop a model for 
that, what's based on linear signals and 
systems. 
Things we already know. 
And, this is going to lead us to look at 
the spectral structure of speech. 
Turns out in the time domain you can only 
gain so much information from examining a 
speech signal. 
But if you look at its spectrum, you look 
at its 4 a transform, then all of a 
sudden a lot of information pops out 
that's very, very useful in understanding 
what is going on and what is being said. 
So, how is speech generated? So here's my 
rather crude drawing. 
Of the speech production system. 
Everything begins with the lungs, and the 
lungs are providing a source of air 
pressure, and, the, important part for us 
And understaning speech, for the vocal 
chords. 
Now seeing from the top, the vocal 
chords, look like, the structure looks 
something like this. 
Where these are tendon like things. 
And in between, there's a slit, that's 
open. 
And, when you breath, the, vocal chords 
this part, is not under any tension, 
they're loose, the slit opens up, and you 
breath normally. 
When you want to say something, what 
happens Is that, these come under 
tension. 
They are pulled. 
And that closes up this slit. 
Alright? And in fact they become 
completely closed and it's not until the 
air pressure from the lungs boils up 
enough that forces this slit to open 
momentarily, releasing a puff of air, 
which goes into the vocal tract. 
The vocal chords close. 
Air pressure builds up again, it released 
rather quickly giving another puff of 
air, and that's what I've tried to 
indicate there. 
So if you were to plot as a function of 
time, the air pressure just above the 
vocal cords, you would get nothing for a 
while because the pressure below is 
building up and finally it releases, 
releasing that puff of air and it comes 
back down. 
And then the pressure builds up from the 
lungs again, and it pops up again, and 
releases. 
And roughly, it produces a periodic 
pulse-like sequence from the vocal cords. 
And this period is known as the pitch 
period. 
And we're going to talk more about that 
in just a second. 
Well, now what happens is that these 
puffs of air. 
Go into the, what's called the vocal 
tract, which is formed by all of these 
structures, the tongue the lips, the 
opening of your mouth, all kinds of 
things. 
And if I were to draw this again, so we 
might have something that looks like 
this. 
Opens up. 
kind of has some teeth, here at the end. 
And what's happening, those puffs of air 
are coming in there. 
What acoustically, what your mouth looks 
like, is a, a pipe. 
It looks like If I was to straighten this 
out, it would look like a pipe who has a 
cross sectional area that's varying. 
So, this is the length of the pipe and 
this is it's cross section. 
And acoustically, your mouth looks like. 
A straight pipe which has a complicated 
structure. 
Well, what happens now is the air 
pressure signal comes in here and it, the 
word is it excites the. 
Air pressure, inside the vocal tract. 
Well that, turns out, we can describe 
this as a linear filter, having peaks, 
that are called resonances, and, this is 
functions in many ways, like an organ 
pipe. 
Organ pipe is a straight tube, has a very 
simple resonant structure. 
You, the way a note sounds on an organ. 
As you push the key what happens? Air is 
forced into one of those pipes and it 
gives you this nice, not quite pure tone 
but resonance structure having harmonics 
of some fundamental. 
Because of the variable cross sectional 
area here it turns out resonance 
structure become more complicated. 
And that is what we are going to see in 
the next slide. 
So let's summarize what the speech model 
Is at least in terms of a system 
theoretic model. 
So, we have the lungs. 
And what they're producing is actually a 
pretty boring signal. 
It's basically a constant. 
The vocal cords will either be open while 
you're breathing or. 
They will be on, put under tension, when 
you're trying to speak and that means 
there's some neural control coming from 
the brain that's controlling that. 
The, that, where the point of your speech 
is the only thing we care about, that 
produces What we're going to describe is 
a periodic pulse train, a periodic pulse 
sequence. 
Something we've already talked about. 
That serves as the input to the vocal 
tract which according to what you want to 
say. 
The position of the tongue, the lips 
everything else is again under neural 
control and that determines what is being 
said. 
And the result is the speech signal that 
you pick up with a microphone. 
Well let's examine this in more detail to 
figure out important characteristics of 
these signals. 
So here's that model again and I want to 
start with the. 
input here the periodics pitch signal 
which is a periodic pulse train, we've 
already talked about this many times. 
So it's consistent but a set of pulses 
quite narrow and T is what's know as the 
pitch period. 
But what people normally refer to as the 
pitch is the pitch frequency which is 1/T 
and in the speech world that's called F0, 
and that's the pitch. 
So as you change the tension of your 
vocal cords you can make your pitch go up 
or go down if you loosen it. 
and that's all again under. 
Neural control. 
Now, that, signal serves as the input 
into what we're going to describe as a 
linear filter. 
Which has a transfer function which will 
change according to the vowel or the 
speech sound you're trying to make. 
So. 
I have here, 2, sounds, vowel sounds that 
I chose, the O and the E, and this is 
what I'm plotting here is the transfer 
function, the vocal tract. 
So an O sound has a series of peaks, 
these are the resonances I was referring 
to. 
And so does the E sound. 
It has a series of peaks, but you can 
see, they have a very different 
structure, and, let's look at that 
structure in more detail. 
In the speech world, each of these peaks 
is called a formant, and the formants are 
determined by their. 
Frequencies, which are just numbered 
sequentially. 
So the lowest the, the frequency of the 
peak having, at the lowest frequency 
calling at 1, then next higher up is 
called at 2, the next one's at 3, then at 
4. 
So their all just numbered sequentially 
according to the order. 
they appear from low to high in the 
speech spectrum. 
So the o has a moderately high f 1, a 
very low f 2, f 3 is shifted up making a 
nice valley. 
And then there's f 4 and f 5. 
For the e sound, f 1 is lower than it is 
for the o. 
F2 is all the way up here, so that 
resonance structure has really changed a 
lot. 
F3 is a little higher than it was for the 
O and then F4 and F5 are roughly at the 
same place. 
So. 
You can actually look at the spectra and 
figure out what the vowel is that's being 
said. 
Now if you look in the time domain, 
things are not quite so clear. 
Here is a segment of speech corosponding 
to me saying oh and a The segment of 
speech corresponding to me saying e. 
Now what you should recall is back when 
we talked about sending this periodic 
pulse sequence through our r c low pass 
filter. 
What we got out was again something that 
was periodic, kind of looked like this. 
And so we should expect something similar 
here. 
We have the same pulse sequence going 
into a linear filter. 
It's a bit more complicated than a simple 
rc lowpass, but again we should see a 
periodic Output so I used this example at 
the beginning of the course when we 
talked about signals, and you should be 
it should, the period should be readily 
evident. 
I'd like you to tell me what the period 
is for the vowel. 
E in this case, okay? I get a period of 
roughly 10 milliseconds, this scale is in 
seconds. 
the separation from these peaks is about 
10 milliseconds, which makes the pitch 
100. 
Hertz. 
Now, another questions for you. 
let me clear the, slide so we can see 
things. 
What, is the pitch here, and in 
particular, is it higher or lower, than 
it is for the e? And when I talk about 
pitch, we really usually you, worry about 
the frequency not The, interval, but the 
frequency. 
Is the pitch frequency lower, or higher, 
for the o? Okay? I think it's pretty 
evident, that the pitch, is frequency is 
lower. 
Because there are fewer there's only 1 
period there, and I see at least 2 there, 
so in a shorter period of time. 
So the pitch is lowered a little bit 
below, 100 hertz. 
So, what should we expect, the, spectrum 
of the speech to look like? Well, as we 
know, the spectrum Is equal to the 
transfer function times the foray 
transform,. 
In this case the foray series of the 
pitch signal. 
So, you should expect, the, transfer 
function, h of f, which is one of these, 
to be multiplied times the Fourier series 
for this, which consists of a set of 
harmonics. 
The harmonics, they're all harmonics of f 
0. 
So, our, model for speech, is a little 
bit simpler than what I've show. 
We just worry about. 
What the pitch sig, signal is, basically 
what it's period is, what it's pitch 
frequency is. 
And then we worry, about what the 
transfer function is, and, that is how we 
describe the speech Signal. 
And let me show you what an actual speech 
spectrum looks like and now it makes a 
little bit more sense what we're seeing. 
So you can see these peaks in the 
spectrum of our periodic pulse train, is 
going to be a set of lines, decaying like 
a sinc function. 
At the harmonics of f 0. 
So that, would be at F0. 
2F0, 3F0, etcetera. 
And then it corresponds to each of these 
peaks. 
And they will go up forever, but 
generally go down. 
In some places it gets rather hard to see 
that peak structure and that's because 
the energy is getting rather low. 
and furthermore, that departs from the 
shape of the transfer function a little 
bit and that's because the sinc function. 
Character of the 3 coefficients is 
reducing it. 
And so you can see that the amplitude of 
these pitch lines is multiplied by 
something that does look like the 
transfer function for the letter 'O' and 
so now we can see. 
what's going on. 
We should be able to look at a spectrum, 
and see what the pitch is, and get a 
rough idea of what the format structure 
is. 
At least saying generally where's F1, Is 
it high or low? Is that too really high 
or really low. 
And I think you can see that fairly 
easily form this plot. 
That is why this frequency dominated 
speech the structure of speech is much 
more apparent that it is in the time 
domain. 
All right. 
So, here is what we call the Speech 
Spectrogram. 
And this is a special plot that plots 
Frequency on the vertical axis and what 
it's plotting at each moment in time is 
the spectrum as a heat map. 
So, there's a spectrum corresponding to 
each column of the image here and the 
amplitude of the spectrum is encoded as a 
color. 
So The very deep red corresponds to the 
highest amplitudes. 
The yellowish are smaller than that. 
The greens are even smaller than that, 
and the blues are their smallest areas. 
Essentially the temperature reflects 
amplitude, and that's why it's called a 
heat map of course. 
All right, so. 
What we see, below, is the time demand, 
plot of the, speech signal. 
And that's me saying Rice University. 
And, then there are some segments where 
you can roughly make out the periodic 
nature of this speech signal. 
The scale here is highly compressed, but 
going over 1.2 seconds. 
So it is very hard to see. 
what's much clearer is what's going on in 
the frequency domain. 
Makes everything much clearer to me. 
So the first thing I want to point out 
are these lines. 
Cross here, and we now know those are the 
pitch lines. 
That corresponds to the pitch. 
They're harmonics of the pitch. 
Over here. 
Looks to me like the pitch is going up. 
Alright, cause the lines are getting, 
moving up to higher frequencies. 
So that's been saying, rise, rise. 
Here it looks like they're going down a 
little bit. 
Over here, they're relatively constant. 
So you can get an idea of how this 
speaker is saying the words. 
By looking at those pitch contours. 
Now, what's more important to 
understanding it atleast in English, what 
is being said are the, the formats. 
So, I see here a that's probably F1. 
Here is F2. 
So F1 shows the constant, F2 is moving 
up, and this could be a 3, and then over 
here, it could be a 4. 
Here you can see, F2 really moving up. 
F1 maybe moving down a little bit and 
here's F2 really moving up and F1 moving 
down, when I'm saying T. 
And so we can get a very detailed idea of 
what's going on in this speech signal. 
In fact there are experts out there that 
if you give them a spectrogram, and tell 
them what the language that is being 
spoken, then they can tell you what's 
being said and whether it's a male or 
female that's saying it. 
yet because I know it's me I have some 
idea of scale and I can tell you the 
pitch is much more indicative of a male 
than a female. 
Females have a higher range of pitch 
frequencies than males. 
This one is so low, about 100 Hertz, 
that, that probably is a male. 
If it goes much higher, higher than about 
120 Hertz, then that's probably a female 
and children are even higher. 
Now, I want to point out some things, and 
that is, there are some aspects of this 
that aren't Don't fit the speech model 
and don't have pitch lines, and that's 
corresponding to these areas. 
And that's when you're saying '[SOUND]' 
[SOUND], and that corresponds to what we 
call Fricatives. 
And these are words in which the vocal 
chords open, the lung 
air pressure from the lungs just hits the 
vocal track, and then by, making closing 
the teeth, or the lips, you, create 
essentialy, a turbulent noise. 
It doesn't have much structure in the 
time domain, but in the frequency domain 
it occurs at high Frequencies in this 
plotted is occurring higher than any of 
the formans. 
And that is pretty typical. 
It's very interesting that in the old 
days, in the days of analog telephones, 
that they would low pass filter the 
speech signal with a cutoff frequency of 
about 3,300 hertz, cutting out everything 
above that. 
Now it turns out that is high enough but 
it doesn't really affect the format 
structure all that much. 
At least, in terms of understanding 
what's being said, but for the fricative 
signals definitely makes it very hard. 
I know in the old days, it was very hard 
to tell words apart That are differ only 
in their fricative. 
Fix and six were incredibly difficult to 
tell apart over old telephone systems. 
And in fact, I suggest you try it, try it 
with a friend. 
Call them up and say these words in 
isolation. 
And see if your friend can understand 
what you're saying. 
Now, of course, if you put these in a 
sentence you can easily tell them apart. 
I, you can say I fixed my car. 
They certainly, you certainly wouldn't 
say I sixed my car. 
It doesn't make any sense. 
So if you say them out of context, in 
isolation, one word at a time. 
you're. 
I think you're going to find and in some 
cases you're going to have a hard time 
telling them apart. 
If your phone system has a much higher 
cutoff frequency, then those get through. 
So you can judge what's going on in the 
phone system by playing the, a little 
word recognition game. 
Now, there's something else I want to 
point out here, and that is the highest 
frequency in this plot is about 5,500 
hertz. 
And we'll go into why that is when we 
talk about digital signal processing. 
it turns out there are some signals, 
speech signals, that you can have higher 
frequency content. 
and we usually say that the bandwidth is 
about 6-7 kHz. 
It's very approximate, because you are 
seeing the transfer function for the 
vowels is going down, as frequency goes 
up generally, beyond about 4-5 kHz. 
It's definitely going down, so the 
bandwidth is only a rough measure of 
what's going on. 
So, we now can see a lot of things about 
the structure of speech. 
Now I claim, if you look at the 
Spectrogram, there's something not speech 
like in the Spectrogram. 
There's something here that isn't right. 
And in particular. 
I want to point out this line, it's going 
across here. 
What in the world does that correspond 
to? There are two answers, one is what 
kind of signal would produce that line, 
and what's physically producing that 
line. 
Okay, now, physically what that looks 
like, in term of signals, it turns out 
it's, looks line a sinewave, although 
frequency of that, oh, 1800 Hertz. 
Well that's pretty high. 
It's certainly not power frequency and it 
turns out that's fan noise from my 
computer. 
I couldn't get my computer to turn its 
fan off when I made this recording and 
you're picking that up. 
And you can even see there's another 
piece of it up there at a higher 
frequency. 
Okay. 
So, we now know what the structure of 
speech is. 
First of all is has a band width of about 
6 or 7 kilohertz and that's important in 
many applications. 
the structure of speech is best seen in 
the Frequency domain there are some time 
domain things you can pick out like what 
the pitch period is it could be pretty 
easy. 
But it's hard to see how pitch is 
changing very easily with time until you 
look at the spectrogram, you look in the 
frequency domain. 
I want to point out,that many other 
signals,share the same frequency range as 
pitch and signal processing systems they, 
They exploit in great detail this special 
structure of speech, so that they can 
send speech signals efficiently. 
And these are usually called vocoders, 
and that is short poor voice coder. 
They typically, take speech in a way I 
like to put it. 
Rip apart the speech signal into the 
pitch part and the vocal track part. 
They actually try to split that signal up 
into those 2 pieces. 
That voice coder is embedded into cell 
phones and in particular, they try, they 
use very efficient way at computing. 
Communicating with the vocal tract is, 
more more so than just sending the raw 
speech signal that would, you would 
normally do if you just had some regular 
six of seven kilohertz signal. 
Now it turns out voice, vocecoders have 
slipped into the music arena. 
And I'm sure if you've heard modern 
hip-hop, what they're doing is exactly 
what vocoder does. 
They're ripping apart the signal into the 
vocal track part and, and the pitch part. 
They're playing with the pitch part a 
little bit. 
Distorting it from being a pure set of 
pulses, and then putting it through a 
filter that looks like the vocal tract 
they measured. 
And it turns out the music sounds, at 
least to my ears, weird. 
but that's how voice coders work. 
And by the way, in a few modern computer 
technology and signal processing systems 
are such that you can do this in real 
time. 
You can actually do this while the singer 
is singing, it's kind of interesting. 
But for our purposes what they're doing 
and what's important is they're 
exploiting this special structure of 
speech. 
And you will learn that the more you know 
about signals. 
More effectively and efficiently you can 
communicate those kind of signals.