In this video, we're going to talk about the Speech Signal and try to understand its structure. That structure is exploited in all kinds of systems, including modern telephone systems and cell phone communication systems. It's really improtant to understand the structure. It's one thing to have a general characterization like, what bandwidth, what kind of frequencies does the speech signal occupy. But there's more structure beyond that, that's going to be very important in designing efficient communications systems For speech. So all the, aspects of the speech, signal, are determined by how it's generated. And we're going to develop a model for that, what's based on linear signals and systems. Things we already know. And, this is going to lead us to look at the spectral structure of speech. Turns out in the time domain you can only gain so much information from examining a speech signal. But if you look at its spectrum, you look at its 4 a transform, then all of a sudden a lot of information pops out that's very, very useful in understanding what is going on and what is being said. So, how is speech generated? So here's my rather crude drawing. Of the speech production system. Everything begins with the lungs, and the lungs are providing a source of air pressure, and, the, important part for us And understaning speech, for the vocal chords. Now seeing from the top, the vocal chords, look like, the structure looks something like this. Where these are tendon like things. And in between, there's a slit, that's open. And, when you breath, the, vocal chords this part, is not under any tension, they're loose, the slit opens up, and you breath normally. When you want to say something, what happens Is that, these come under tension. They are pulled. And that closes up this slit. Alright? And in fact they become completely closed and it's not until the air pressure from the lungs boils up enough that forces this slit to open momentarily, releasing a puff of air, which goes into the vocal tract. The vocal chords close. Air pressure builds up again, it released rather quickly giving another puff of air, and that's what I've tried to indicate there. So if you were to plot as a function of time, the air pressure just above the vocal cords, you would get nothing for a while because the pressure below is building up and finally it releases, releasing that puff of air and it comes back down. And then the pressure builds up from the lungs again, and it pops up again, and releases. And roughly, it produces a periodic pulse-like sequence from the vocal cords. And this period is known as the pitch period. And we're going to talk more about that in just a second. Well, now what happens is that these puffs of air. Go into the, what's called the vocal tract, which is formed by all of these structures, the tongue the lips, the opening of your mouth, all kinds of things. And if I were to draw this again, so we might have something that looks like this. Opens up. kind of has some teeth, here at the end. And what's happening, those puffs of air are coming in there. What acoustically, what your mouth looks like, is a, a pipe. It looks like If I was to straighten this out, it would look like a pipe who has a cross sectional area that's varying. So, this is the length of the pipe and this is it's cross section. And acoustically, your mouth looks like. A straight pipe which has a complicated structure. Well, what happens now is the air pressure signal comes in here and it, the word is it excites the. Air pressure, inside the vocal tract. Well that, turns out, we can describe this as a linear filter, having peaks, that are called resonances, and, this is functions in many ways, like an organ pipe. Organ pipe is a straight tube, has a very simple resonant structure. You, the way a note sounds on an organ. As you push the key what happens? Air is forced into one of those pipes and it gives you this nice, not quite pure tone but resonance structure having harmonics of some fundamental. Because of the variable cross sectional area here it turns out resonance structure become more complicated. And that is what we are going to see in the next slide. So let's summarize what the speech model Is at least in terms of a system theoretic model. So, we have the lungs. And what they're producing is actually a pretty boring signal. It's basically a constant. The vocal cords will either be open while you're breathing or. They will be on, put under tension, when you're trying to speak and that means there's some neural control coming from the brain that's controlling that. The, that, where the point of your speech is the only thing we care about, that produces What we're going to describe is a periodic pulse train, a periodic pulse sequence. Something we've already talked about. That serves as the input to the vocal tract which according to what you want to say. The position of the tongue, the lips everything else is again under neural control and that determines what is being said. And the result is the speech signal that you pick up with a microphone. Well let's examine this in more detail to figure out important characteristics of these signals. So here's that model again and I want to start with the. input here the periodics pitch signal which is a periodic pulse train, we've already talked about this many times. So it's consistent but a set of pulses quite narrow and T is what's know as the pitch period. But what people normally refer to as the pitch is the pitch frequency which is 1/T and in the speech world that's called F0, and that's the pitch. So as you change the tension of your vocal cords you can make your pitch go up or go down if you loosen it. and that's all again under. Neural control. Now, that, signal serves as the input into what we're going to describe as a linear filter. Which has a transfer function which will change according to the vowel or the speech sound you're trying to make. So. I have here, 2, sounds, vowel sounds that I chose, the O and the E, and this is what I'm plotting here is the transfer function, the vocal tract. So an O sound has a series of peaks, these are the resonances I was referring to. And so does the E sound. It has a series of peaks, but you can see, they have a very different structure, and, let's look at that structure in more detail. In the speech world, each of these peaks is called a formant, and the formants are determined by their. Frequencies, which are just numbered sequentially. So the lowest the, the frequency of the peak having, at the lowest frequency calling at 1, then next higher up is called at 2, the next one's at 3, then at 4. So their all just numbered sequentially according to the order. they appear from low to high in the speech spectrum. So the o has a moderately high f 1, a very low f 2, f 3 is shifted up making a nice valley. And then there's f 4 and f 5. For the e sound, f 1 is lower than it is for the o. F2 is all the way up here, so that resonance structure has really changed a lot. F3 is a little higher than it was for the O and then F4 and F5 are roughly at the same place. So. You can actually look at the spectra and figure out what the vowel is that's being said. Now if you look in the time domain, things are not quite so clear. Here is a segment of speech corosponding to me saying oh and a The segment of speech corresponding to me saying e. Now what you should recall is back when we talked about sending this periodic pulse sequence through our r c low pass filter. What we got out was again something that was periodic, kind of looked like this. And so we should expect something similar here. We have the same pulse sequence going into a linear filter. It's a bit more complicated than a simple rc lowpass, but again we should see a periodic Output so I used this example at the beginning of the course when we talked about signals, and you should be it should, the period should be readily evident. I'd like you to tell me what the period is for the vowel. E in this case, okay? I get a period of roughly 10 milliseconds, this scale is in seconds. the separation from these peaks is about 10 milliseconds, which makes the pitch 100. Hertz. Now, another questions for you. let me clear the, slide so we can see things. What, is the pitch here, and in particular, is it higher or lower, than it is for the e? And when I talk about pitch, we really usually you, worry about the frequency not The, interval, but the frequency. Is the pitch frequency lower, or higher, for the o? Okay? I think it's pretty evident, that the pitch, is frequency is lower. Because there are fewer there's only 1 period there, and I see at least 2 there, so in a shorter period of time. So the pitch is lowered a little bit below, 100 hertz. So, what should we expect, the, spectrum of the speech to look like? Well, as we know, the spectrum Is equal to the transfer function times the foray transform,. In this case the foray series of the pitch signal. So, you should expect, the, transfer function, h of f, which is one of these, to be multiplied times the Fourier series for this, which consists of a set of harmonics. The harmonics, they're all harmonics of f 0. So, our, model for speech, is a little bit simpler than what I've show. We just worry about. What the pitch sig, signal is, basically what it's period is, what it's pitch frequency is. And then we worry, about what the transfer function is, and, that is how we describe the speech Signal. And let me show you what an actual speech spectrum looks like and now it makes a little bit more sense what we're seeing. So you can see these peaks in the spectrum of our periodic pulse train, is going to be a set of lines, decaying like a sinc function. At the harmonics of f 0. So that, would be at F0. 2F0, 3F0, etcetera. And then it corresponds to each of these peaks. And they will go up forever, but generally go down. In some places it gets rather hard to see that peak structure and that's because the energy is getting rather low. and furthermore, that departs from the shape of the transfer function a little bit and that's because the sinc function. Character of the 3 coefficients is reducing it. And so you can see that the amplitude of these pitch lines is multiplied by something that does look like the transfer function for the letter 'O' and so now we can see. what's going on. We should be able to look at a spectrum, and see what the pitch is, and get a rough idea of what the format structure is. At least saying generally where's F1, Is it high or low? Is that too really high or really low. And I think you can see that fairly easily form this plot. That is why this frequency dominated speech the structure of speech is much more apparent that it is in the time domain. All right. So, here is what we call the Speech Spectrogram. And this is a special plot that plots Frequency on the vertical axis and what it's plotting at each moment in time is the spectrum as a heat map. So, there's a spectrum corresponding to each column of the image here and the amplitude of the spectrum is encoded as a color. So The very deep red corresponds to the highest amplitudes. The yellowish are smaller than that. The greens are even smaller than that, and the blues are their smallest areas. Essentially the temperature reflects amplitude, and that's why it's called a heat map of course. All right, so. What we see, below, is the time demand, plot of the, speech signal. And that's me saying Rice University. And, then there are some segments where you can roughly make out the periodic nature of this speech signal. The scale here is highly compressed, but going over 1.2 seconds. So it is very hard to see. what's much clearer is what's going on in the frequency domain. Makes everything much clearer to me. So the first thing I want to point out are these lines. Cross here, and we now know those are the pitch lines. That corresponds to the pitch. They're harmonics of the pitch. Over here. Looks to me like the pitch is going up. Alright, cause the lines are getting, moving up to higher frequencies. So that's been saying, rise, rise. Here it looks like they're going down a little bit. Over here, they're relatively constant. So you can get an idea of how this speaker is saying the words. By looking at those pitch contours. Now, what's more important to understanding it atleast in English, what is being said are the, the formats. So, I see here a that's probably F1. Here is F2. So F1 shows the constant, F2 is moving up, and this could be a 3, and then over here, it could be a 4. Here you can see, F2 really moving up. F1 maybe moving down a little bit and here's F2 really moving up and F1 moving down, when I'm saying T. And so we can get a very detailed idea of what's going on in this speech signal. In fact there are experts out there that if you give them a spectrogram, and tell them what the language that is being spoken, then they can tell you what's being said and whether it's a male or female that's saying it. yet because I know it's me I have some idea of scale and I can tell you the pitch is much more indicative of a male than a female. Females have a higher range of pitch frequencies than males. This one is so low, about 100 Hertz, that, that probably is a male. If it goes much higher, higher than about 120 Hertz, then that's probably a female and children are even higher. Now, I want to point out some things, and that is, there are some aspects of this that aren't Don't fit the speech model and don't have pitch lines, and that's corresponding to these areas. And that's when you're saying '[SOUND]' [SOUND], and that corresponds to what we call Fricatives. And these are words in which the vocal chords open, the lung air pressure from the lungs just hits the vocal track, and then by, making closing the teeth, or the lips, you, create essentialy, a turbulent noise. It doesn't have much structure in the time domain, but in the frequency domain it occurs at high Frequencies in this plotted is occurring higher than any of the formans. And that is pretty typical. It's very interesting that in the old days, in the days of analog telephones, that they would low pass filter the speech signal with a cutoff frequency of about 3,300 hertz, cutting out everything above that. Now it turns out that is high enough but it doesn't really affect the format structure all that much. At least, in terms of understanding what's being said, but for the fricative signals definitely makes it very hard. I know in the old days, it was very hard to tell words apart That are differ only in their fricative. Fix and six were incredibly difficult to tell apart over old telephone systems. And in fact, I suggest you try it, try it with a friend. Call them up and say these words in isolation. And see if your friend can understand what you're saying. Now, of course, if you put these in a sentence you can easily tell them apart. I, you can say I fixed my car. They certainly, you certainly wouldn't say I sixed my car. It doesn't make any sense. So if you say them out of context, in isolation, one word at a time. you're. I think you're going to find and in some cases you're going to have a hard time telling them apart. If your phone system has a much higher cutoff frequency, then those get through. So you can judge what's going on in the phone system by playing the, a little word recognition game. Now, there's something else I want to point out here, and that is the highest frequency in this plot is about 5,500 hertz. And we'll go into why that is when we talk about digital signal processing. it turns out there are some signals, speech signals, that you can have higher frequency content. and we usually say that the bandwidth is about 6-7 kHz. It's very approximate, because you are seeing the transfer function for the vowels is going down, as frequency goes up generally, beyond about 4-5 kHz. It's definitely going down, so the bandwidth is only a rough measure of what's going on. So, we now can see a lot of things about the structure of speech. Now I claim, if you look at the Spectrogram, there's something not speech like in the Spectrogram. There's something here that isn't right. And in particular. I want to point out this line, it's going across here. What in the world does that correspond to? There are two answers, one is what kind of signal would produce that line, and what's physically producing that line. Okay, now, physically what that looks like, in term of signals, it turns out it's, looks line a sinewave, although frequency of that, oh, 1800 Hertz. Well that's pretty high. It's certainly not power frequency and it turns out that's fan noise from my computer. I couldn't get my computer to turn its fan off when I made this recording and you're picking that up. And you can even see there's another piece of it up there at a higher frequency. Okay. So, we now know what the structure of speech is. First of all is has a band width of about 6 or 7 kilohertz and that's important in many applications. the structure of speech is best seen in the Frequency domain there are some time domain things you can pick out like what the pitch period is it could be pretty easy. But it's hard to see how pitch is changing very easily with time until you look at the spectrogram, you look in the frequency domain. I want to point out,that many other signals,share the same frequency range as pitch and signal processing systems they, They exploit in great detail this special structure of speech, so that they can send speech signals efficiently. And these are usually called vocoders, and that is short poor voice coder. They typically, take speech in a way I like to put it. Rip apart the speech signal into the pitch part and the vocal track part. They actually try to split that signal up into those 2 pieces. That voice coder is embedded into cell phones and in particular, they try, they use very efficient way at computing. Communicating with the vocal tract is, more more so than just sending the raw speech signal that would, you would normally do if you just had some regular six of seven kilohertz signal. Now it turns out voice, vocecoders have slipped into the music arena. And I'm sure if you've heard modern hip-hop, what they're doing is exactly what vocoder does. They're ripping apart the signal into the vocal track part and, and the pitch part. They're playing with the pitch part a little bit. Distorting it from being a pure set of pulses, and then putting it through a filter that looks like the vocal tract they measured. And it turns out the music sounds, at least to my ears, weird. but that's how voice coders work. And by the way, in a few modern computer technology and signal processing systems are such that you can do this in real time. You can actually do this while the singer is singing, it's kind of interesting. But for our purposes what they're doing and what's important is they're exploiting this special structure of speech. And you will learn that the more you know about signals. More effectively and efficiently you can communicate those kind of signals.