1 00:00:00,012 --> 00:00:04,937 In this video, we're going to talk about the Speech Signal and try to understand 2 00:00:04,937 --> 00:00:08,477 its structure. That structure is exploited in all kinds 3 00:00:08,477 --> 00:00:13,372 of systems, including modern telephone systems and cell phone communication 4 00:00:13,372 --> 00:00:16,354 systems. It's really improtant to understand the 5 00:00:16,354 --> 00:00:18,847 structure. It's one thing to have a general 6 00:00:18,847 --> 00:00:23,519 characterization like, what bandwidth, what kind of frequencies does the speech 7 00:00:23,519 --> 00:00:26,730 signal occupy. But there's more structure beyond that, 8 00:00:26,730 --> 00:00:30,886 that's going to be very important in designing efficient communications 9 00:00:30,886 --> 00:00:35,592 systems For speech. So all the, aspects of the speech, 10 00:00:35,592 --> 00:00:39,132 signal, are determined by how it's generated. 11 00:00:39,132 --> 00:00:44,862 And we're going to develop a model for that, what's based on linear signals and 12 00:00:44,862 --> 00:00:47,417 systems. Things we already know. 13 00:00:47,417 --> 00:00:52,872 And, this is going to lead us to look at the spectral structure of speech. 14 00:00:52,872 --> 00:00:58,372 Turns out in the time domain you can only gain so much information from examining a 15 00:00:58,372 --> 00:01:02,147 speech signal. But if you look at its spectrum, you look 16 00:01:02,147 --> 00:01:06,847 at its 4 a transform, then all of a sudden a lot of information pops out 17 00:01:06,847 --> 00:01:12,372 that's very, very useful in understanding what is going on and what is being said. 18 00:01:12,372 --> 00:01:16,912 So, how is speech generated? So here's my rather crude drawing. 19 00:01:16,912 --> 00:01:25,232 Of the speech production system. Everything begins with the lungs, and the 20 00:01:25,232 --> 00:01:33,372 lungs are providing a source of air pressure, and, the, important part for us 21 00:01:33,372 --> 00:01:37,797 And understaning speech, for the vocal chords. 22 00:01:37,797 --> 00:01:44,602 Now seeing from the top, the vocal chords, look like, the structure looks 23 00:01:44,602 --> 00:01:50,647 something like this. Where these are tendon like things. 24 00:01:50,647 --> 00:01:54,842 And in between, there's a slit, that's open. 25 00:01:54,842 --> 00:02:02,297 And, when you breath, the, vocal chords this part, is not under any tension, 26 00:02:02,297 --> 00:02:07,977 they're loose, the slit opens up, and you breath normally. 27 00:02:07,977 --> 00:02:16,299 When you want to say something, what happens Is that, these come under 28 00:02:16,299 --> 00:02:20,563 tension. They are pulled. 29 00:02:20,563 --> 00:02:28,981 And that closes up this slit. Alright? And in fact they become 30 00:02:28,981 --> 00:02:36,115 completely closed and it's not until the air pressure from the lungs boils up 31 00:02:36,115 --> 00:02:43,062 enough that forces this slit to open momentarily, releasing a puff of air, 32 00:02:43,062 --> 00:02:48,822 which goes into the vocal tract. The vocal chords close. 33 00:02:48,822 --> 00:02:54,197 Air pressure builds up again, it released rather quickly giving another puff of 34 00:02:54,197 --> 00:02:57,622 air, and that's what I've tried to indicate there. 35 00:02:57,622 --> 00:03:02,647 So if you were to plot as a function of time, the air pressure just above the 36 00:03:02,647 --> 00:03:07,747 vocal cords, you would get nothing for a while because the pressure below is 37 00:03:07,747 --> 00:03:12,847 building up and finally it releases, releasing that puff of air and it comes 38 00:03:12,847 --> 00:03:17,560 back down. And then the pressure builds up from the 39 00:03:17,560 --> 00:03:22,100 lungs again, and it pops up again, and releases. 40 00:03:22,100 --> 00:03:29,462 And roughly, it produces a periodic pulse-like sequence from the vocal cords. 41 00:03:29,462 --> 00:03:34,559 And this period is known as the pitch period. 42 00:03:34,559 --> 00:03:40,623 And we're going to talk more about that in just a second. 43 00:03:40,623 --> 00:03:46,332 Well, now what happens is that these puffs of air. 44 00:03:46,332 --> 00:03:53,528 Go into the, what's called the vocal tract, which is formed by all of these 45 00:03:53,528 --> 00:04:00,971 structures, the tongue the lips, the opening of your mouth, all kinds of 46 00:04:00,971 --> 00:04:05,403 things. And if I were to draw this again, so we 47 00:04:05,403 --> 00:04:09,599 might have something that looks like this. 48 00:04:09,599 --> 00:04:15,167 Opens up. kind of has some teeth, here at the end. 49 00:04:15,167 --> 00:04:21,714 And what's happening, those puffs of air are coming in there. 50 00:04:21,714 --> 00:04:28,189 What acoustically, what your mouth looks like, is a, a pipe. 51 00:04:28,189 --> 00:04:36,712 It looks like If I was to straighten this out, it would look like a pipe who has a 52 00:04:36,712 --> 00:04:44,862 cross sectional area that's varying. So, this is the length of the pipe and 53 00:04:44,862 --> 00:04:52,582 this is it's cross section. And acoustically, your mouth looks like. 54 00:04:52,582 --> 00:04:57,888 A straight pipe which has a complicated structure. 55 00:04:57,888 --> 00:05:05,207 Well, what happens now is the air pressure signal comes in here and it, the 56 00:05:05,207 --> 00:05:10,970 word is it excites the. Air pressure, inside the vocal tract. 57 00:05:10,970 --> 00:05:16,712 Well that, turns out, we can describe this as a linear filter, having peaks, 58 00:05:16,712 --> 00:05:22,597 that are called resonances, and, this is functions in many ways, like an organ 59 00:05:22,597 --> 00:05:26,089 pipe. Organ pipe is a straight tube, has a very 60 00:05:26,089 --> 00:05:31,592 simple resonant structure. You, the way a note sounds on an organ. 61 00:05:31,592 --> 00:05:36,692 As you push the key what happens? Air is forced into one of those pipes and it 62 00:05:36,692 --> 00:05:42,492 gives you this nice, not quite pure tone but resonance structure having harmonics 63 00:05:42,492 --> 00:05:46,642 of some fundamental. Because of the variable cross sectional 64 00:05:46,642 --> 00:05:51,642 area here it turns out resonance structure become more complicated. 65 00:05:51,642 --> 00:05:55,217 And that is what we are going to see in the next slide. 66 00:05:55,217 --> 00:06:00,747 So let's summarize what the speech model Is at least in terms of a system 67 00:06:00,747 --> 00:06:04,132 theoretic model. So, we have the lungs. 68 00:06:04,132 --> 00:06:09,442 And what they're producing is actually a pretty boring signal. 69 00:06:09,442 --> 00:06:15,182 It's basically a constant. The vocal cords will either be open while 70 00:06:15,182 --> 00:06:19,966 you're breathing or. They will be on, put under tension, when 71 00:06:19,966 --> 00:06:25,101 you're trying to speak and that means there's some neural control coming from 72 00:06:25,101 --> 00:06:30,237 the brain that's controlling that. The, that, where the point of your speech 73 00:06:30,237 --> 00:06:35,427 is the only thing we care about, that produces What we're going to describe is 74 00:06:35,427 --> 00:06:39,117 a periodic pulse train, a periodic pulse sequence. 75 00:06:39,117 --> 00:06:44,507 Something we've already talked about. That serves as the input to the vocal 76 00:06:44,507 --> 00:06:48,087 tract which according to what you want to say. 77 00:06:48,087 --> 00:06:53,687 The position of the tongue, the lips everything else is again under neural 78 00:06:53,687 --> 00:06:57,362 control and that determines what is being said. 79 00:06:57,362 --> 00:07:02,783 And the result is the speech signal that you pick up with a microphone. 80 00:07:02,783 --> 00:07:09,422 Well let's examine this in more detail to figure out important characteristics of 81 00:07:09,422 --> 00:07:13,525 these signals. So here's that model again and I want to 82 00:07:13,525 --> 00:07:19,048 start with the. input here the periodics pitch signal 83 00:07:19,048 --> 00:07:25,567 which is a periodic pulse train, we've already talked about this many times. 84 00:07:25,567 --> 00:07:32,040 So it's consistent but a set of pulses quite narrow and T is what's know as the 85 00:07:32,040 --> 00:07:36,635 pitch period. But what people normally refer to as the 86 00:07:36,635 --> 00:07:44,091 pitch is the pitch frequency which is 1/T and in the speech world that's called F0, 87 00:07:44,091 --> 00:07:50,244 and that's the pitch. So as you change the tension of your 88 00:07:50,244 --> 00:07:57,489 vocal cords you can make your pitch go up or go down if you loosen it. 89 00:07:57,489 --> 00:08:02,463 and that's all again under. Neural control. 90 00:08:02,463 --> 00:08:09,150 Now, that, signal serves as the input into what we're going to describe as a 91 00:08:09,150 --> 00:08:14,053 linear filter. Which has a transfer function which will 92 00:08:14,053 --> 00:08:20,343 change according to the vowel or the speech sound you're trying to make. 93 00:08:20,343 --> 00:08:25,717 So. I have here, 2, sounds, vowel sounds that 94 00:08:25,717 --> 00:08:32,912 I chose, the O and the E, and this is what I'm plotting here is the transfer 95 00:08:32,912 --> 00:08:39,410 function, the vocal tract. So an O sound has a series of peaks, 96 00:08:39,410 --> 00:08:43,882 these are the resonances I was referring to. 97 00:08:43,882 --> 00:08:49,587 And so does the E sound. It has a series of peaks, but you can 98 00:08:49,587 --> 00:08:56,392 see, they have a very different structure, and, let's look at that 99 00:08:56,392 --> 00:09:02,792 structure in more detail. In the speech world, each of these peaks 100 00:09:02,792 --> 00:09:08,942 is called a formant, and the formants are determined by their. 101 00:09:08,942 --> 00:09:12,427 Frequencies, which are just numbered sequentially. 102 00:09:12,427 --> 00:09:17,695 So the lowest the, the frequency of the peak having, at the lowest frequency 103 00:09:17,695 --> 00:09:22,673 calling at 1, then next higher up is called at 2, the next one's at 3, then at 104 00:09:22,673 --> 00:09:25,501 4. So their all just numbered sequentially 105 00:09:25,501 --> 00:09:31,196 according to the order. they appear from low to high in the 106 00:09:31,196 --> 00:09:37,300 speech spectrum. So the o has a moderately high f 1, a 107 00:09:37,300 --> 00:09:42,759 very low f 2, f 3 is shifted up making a nice valley. 108 00:09:42,759 --> 00:09:49,732 And then there's f 4 and f 5. For the e sound, f 1 is lower than it is 109 00:09:49,732 --> 00:09:53,502 for the o. F2 is all the way up here, so that 110 00:09:53,502 --> 00:09:57,212 resonance structure has really changed a lot. 111 00:09:57,212 --> 00:10:03,292 F3 is a little higher than it was for the O and then F4 and F5 are roughly at the 112 00:10:03,292 --> 00:10:04,722 same place. So. 113 00:10:04,722 --> 00:10:12,291 You can actually look at the spectra and figure out what the vowel is that's being 114 00:10:12,291 --> 00:10:16,097 said. Now if you look in the time domain, 115 00:10:16,097 --> 00:10:23,131 things are not quite so clear. Here is a segment of speech corosponding 116 00:10:23,131 --> 00:10:29,368 to me saying oh and a The segment of speech corresponding to me saying e. 117 00:10:29,368 --> 00:10:35,932 Now what you should recall is back when we talked about sending this periodic 118 00:10:35,932 --> 00:10:39,770 pulse sequence through our r c low pass filter. 119 00:10:39,770 --> 00:10:46,222 What we got out was again something that was periodic, kind of looked like this. 120 00:10:46,222 --> 00:10:50,287 And so we should expect something similar here. 121 00:10:50,287 --> 00:10:55,291 We have the same pulse sequence going into a linear filter. 122 00:10:55,291 --> 00:11:02,198 It's a bit more complicated than a simple rc lowpass, but again we should see a 123 00:11:02,198 --> 00:11:09,330 periodic Output so I used this example at the beginning of the course when we 124 00:11:09,330 --> 00:11:17,102 talked about signals, and you should be it should, the period should be readily 125 00:11:17,102 --> 00:11:21,316 evident. I'd like you to tell me what the period 126 00:11:21,316 --> 00:11:26,692 is for the vowel. E in this case, okay? I get a period of 127 00:11:26,692 --> 00:11:31,930 roughly 10 milliseconds, this scale is in seconds. 128 00:11:31,930 --> 00:11:40,450 the separation from these peaks is about 10 milliseconds, which makes the pitch 129 00:11:40,450 --> 00:11:41,806 100. Hertz. 130 00:11:41,806 --> 00:11:49,089 Now, another questions for you. let me clear the, slide so we can see 131 00:11:49,089 --> 00:11:52,852 things. What, is the pitch here, and in 132 00:11:52,852 --> 00:12:00,487 particular, is it higher or lower, than it is for the e? And when I talk about 133 00:12:00,487 --> 00:12:08,644 pitch, we really usually you, worry about the frequency not The, interval, but the 134 00:12:08,644 --> 00:12:14,111 frequency. Is the pitch frequency lower, or higher, 135 00:12:14,111 --> 00:12:22,828 for the o? Okay? I think it's pretty evident, that the pitch, is frequency is 136 00:12:22,828 --> 00:12:27,677 lower. Because there are fewer there's only 1 137 00:12:27,677 --> 00:12:34,790 period there, and I see at least 2 there, so in a shorter period of time. 138 00:12:34,790 --> 00:12:39,867 So the pitch is lowered a little bit below, 100 hertz. 139 00:12:39,867 --> 00:12:47,412 So, what should we expect, the, spectrum of the speech to look like? Well, as we 140 00:12:47,412 --> 00:12:56,043 know, the spectrum Is equal to the transfer function times the foray 141 00:12:56,043 --> 00:13:02,533 transform,. In this case the foray series of the 142 00:13:02,533 --> 00:13:08,172 pitch signal. So, you should expect, the, transfer 143 00:13:08,172 --> 00:13:15,012 function, h of f, which is one of these, to be multiplied times the Fourier series 144 00:13:15,012 --> 00:13:18,992 for this, which consists of a set of harmonics. 145 00:13:18,992 --> 00:13:22,757 The harmonics, they're all harmonics of f 0. 146 00:13:22,757 --> 00:13:28,717 So, our, model for speech, is a little bit simpler than what I've show. 147 00:13:28,717 --> 00:13:34,392 We just worry about. What the pitch sig, signal is, basically 148 00:13:34,392 --> 00:13:38,657 what it's period is, what it's pitch frequency is. 149 00:13:38,657 --> 00:13:44,902 And then we worry, about what the transfer function is, and, that is how we 150 00:13:44,902 --> 00:13:52,707 describe the speech Signal. And let me show you what an actual speech 151 00:13:52,707 --> 00:14:02,402 spectrum looks like and now it makes a little bit more sense what we're seeing. 152 00:14:02,402 --> 00:14:11,605 So you can see these peaks in the spectrum of our periodic pulse train, is 153 00:14:11,605 --> 00:14:17,960 going to be a set of lines, decaying like a sinc function. 154 00:14:17,960 --> 00:14:23,452 At the harmonics of f 0. So that, would be at F0. 155 00:14:23,452 --> 00:14:28,304 2F0, 3F0, etcetera. And then it corresponds to each of these 156 00:14:28,304 --> 00:14:31,242 peaks. And they will go up forever, but 157 00:14:31,242 --> 00:14:35,683 generally go down. In some places it gets rather hard to see 158 00:14:35,683 --> 00:14:41,132 that peak structure and that's because the energy is getting rather low. 159 00:14:41,132 --> 00:14:47,630 and furthermore, that departs from the shape of the transfer function a little 160 00:14:47,630 --> 00:14:53,887 bit and that's because the sinc function. Character of the 3 coefficients is 161 00:14:53,887 --> 00:14:58,272 reducing it. And so you can see that the amplitude of 162 00:14:58,272 --> 00:15:03,987 these pitch lines is multiplied by something that does look like the 163 00:15:03,987 --> 00:15:08,932 transfer function for the letter 'O' and so now we can see. 164 00:15:08,932 --> 00:15:13,595 what's going on. We should be able to look at a spectrum, 165 00:15:13,595 --> 00:15:19,327 and see what the pitch is, and get a rough idea of what the format structure 166 00:15:19,327 --> 00:15:22,796 is. At least saying generally where's F1, Is 167 00:15:22,796 --> 00:15:27,357 it high or low? Is that too really high or really low. 168 00:15:27,357 --> 00:15:32,382 And I think you can see that fairly easily form this plot. 169 00:15:32,382 --> 00:15:39,007 That is why this frequency dominated speech the structure of speech is much 170 00:15:39,007 --> 00:15:42,932 more apparent that it is in the time domain. 171 00:15:42,932 --> 00:15:47,282 All right. So, here is what we call the Speech 172 00:15:47,282 --> 00:15:51,752 Spectrogram. And this is a special plot that plots 173 00:15:51,752 --> 00:16:00,122 Frequency on the vertical axis and what it's plotting at each moment in time is 174 00:16:00,122 --> 00:16:07,360 the spectrum as a heat map. So, there's a spectrum corresponding to 175 00:16:07,360 --> 00:16:15,632 each column of the image here and the amplitude of the spectrum is encoded as a 176 00:16:15,632 --> 00:16:19,597 color. So The very deep red corresponds to the 177 00:16:19,597 --> 00:16:24,577 highest amplitudes. The yellowish are smaller than that. 178 00:16:24,577 --> 00:16:30,827 The greens are even smaller than that, and the blues are their smallest areas. 179 00:16:30,827 --> 00:16:37,387 Essentially the temperature reflects amplitude, and that's why it's called a 180 00:16:37,387 --> 00:16:40,232 heat map of course. All right, so. 181 00:16:40,232 --> 00:16:46,976 What we see, below, is the time demand, plot of the, speech signal. 182 00:16:46,976 --> 00:16:53,929 And that's me saying Rice University. And, then there are some segments where 183 00:16:53,929 --> 00:16:59,630 you can roughly make out the periodic nature of this speech signal. 184 00:16:59,630 --> 00:17:05,139 The scale here is highly compressed, but going over 1.2 seconds. 185 00:17:05,139 --> 00:17:11,476 So it is very hard to see. what's much clearer is what's going on in 186 00:17:11,476 --> 00:17:17,001 the frequency domain. Makes everything much clearer to me. 187 00:17:17,001 --> 00:17:22,042 So the first thing I want to point out are these lines. 188 00:17:22,042 --> 00:17:27,735 Cross here, and we now know those are the pitch lines. 189 00:17:27,735 --> 00:17:34,501 That corresponds to the pitch. They're harmonics of the pitch. 190 00:17:34,501 --> 00:17:38,629 Over here. Looks to me like the pitch is going up. 191 00:17:38,629 --> 00:17:44,199 Alright, cause the lines are getting, moving up to higher frequencies. 192 00:17:44,199 --> 00:17:49,931 So that's been saying, rise, rise. Here it looks like they're going down a 193 00:17:49,931 --> 00:17:54,096 little bit. Over here, they're relatively constant. 194 00:17:54,096 --> 00:17:58,882 So you can get an idea of how this speaker is saying the words. 195 00:17:58,882 --> 00:18:05,115 By looking at those pitch contours. Now, what's more important to 196 00:18:05,115 --> 00:18:12,080 understanding it atleast in English, what is being said are the, the formats. 197 00:18:12,080 --> 00:18:16,621 So, I see here a that's probably F1. Here is F2. 198 00:18:16,621 --> 00:18:23,408 So F1 shows the constant, F2 is moving up, and this could be a 3, and then over 199 00:18:23,408 --> 00:18:28,741 here, it could be a 4. Here you can see, F2 really moving up. 200 00:18:28,741 --> 00:18:35,378 F1 maybe moving down a little bit and here's F2 really moving up and F1 moving 201 00:18:35,378 --> 00:18:40,722 down, when I'm saying T. And so we can get a very detailed idea of 202 00:18:40,722 --> 00:18:46,972 what's going on in this speech signal. In fact there are experts out there that 203 00:18:46,972 --> 00:18:53,042 if you give them a spectrogram, and tell them what the language that is being 204 00:18:53,042 --> 00:18:58,932 spoken, then they can tell you what's being said and whether it's a male or 205 00:18:58,932 --> 00:19:04,589 female that's saying it. yet because I know it's me I have some 206 00:19:04,589 --> 00:19:10,545 idea of scale and I can tell you the pitch is much more indicative of a male 207 00:19:10,545 --> 00:19:14,785 than a female. Females have a higher range of pitch 208 00:19:14,785 --> 00:19:19,490 frequencies than males. This one is so low, about 100 Hertz, 209 00:19:19,490 --> 00:19:24,753 that, that probably is a male. If it goes much higher, higher than about 210 00:19:24,753 --> 00:19:29,941 120 Hertz, then that's probably a female and children are even higher. 211 00:19:29,941 --> 00:19:35,771 Now, I want to point out some things, and that is, there are some aspects of this 212 00:19:35,771 --> 00:19:43,375 that aren't Don't fit the speech model and don't have pitch lines, and that's 213 00:19:43,375 --> 00:19:50,695 corresponding to these areas. And that's when you're saying '[SOUND]' 214 00:19:50,695 --> 00:19:56,703 [SOUND], and that corresponds to what we call Fricatives. 215 00:19:56,703 --> 00:20:03,282 And these are words in which the vocal chords open, the lung 216 00:20:03,282 --> 00:20:09,653 air pressure from the lungs just hits the vocal track, and then by, making closing 217 00:20:09,653 --> 00:20:14,477 the teeth, or the lips, you, create essentialy, a turbulent noise. 218 00:20:14,477 --> 00:20:20,274 It doesn't have much structure in the time domain, but in the frequency domain 219 00:20:20,274 --> 00:20:29,042 it occurs at high Frequencies in this plotted is occurring higher than any of 220 00:20:29,042 --> 00:20:34,129 the formans. And that is pretty typical. 221 00:20:34,129 --> 00:20:43,972 It's very interesting that in the old days, in the days of analog telephones, 222 00:20:43,972 --> 00:20:51,886 that they would low pass filter the speech signal with a cutoff frequency of 223 00:20:51,886 --> 00:20:57,840 about 3,300 hertz, cutting out everything above that. 224 00:20:57,840 --> 00:21:04,897 Now it turns out that is high enough but it doesn't really affect the format 225 00:21:04,897 --> 00:21:09,787 structure all that much. At least, in terms of understanding 226 00:21:09,787 --> 00:21:16,192 what's being said, but for the fricative signals definitely makes it very hard. 227 00:21:16,192 --> 00:21:22,265 I know in the old days, it was very hard to tell words apart That are differ only 228 00:21:22,265 --> 00:21:26,569 in their fricative. Fix and six were incredibly difficult to 229 00:21:26,569 --> 00:21:32,179 tell apart over old telephone systems. And in fact, I suggest you try it, try it 230 00:21:32,179 --> 00:21:35,639 with a friend. Call them up and say these words in 231 00:21:35,639 --> 00:21:39,223 isolation. And see if your friend can understand 232 00:21:39,223 --> 00:21:42,772 what you're saying. Now, of course, if you put these in a 233 00:21:42,772 --> 00:21:47,155 sentence you can easily tell them apart. I, you can say I fixed my car. 234 00:21:47,155 --> 00:21:50,836 They certainly, you certainly wouldn't say I sixed my car. 235 00:21:50,836 --> 00:21:54,826 It doesn't make any sense. So if you say them out of context, in 236 00:21:54,826 --> 00:21:57,188 isolation, one word at a time. you're. 237 00:21:57,188 --> 00:22:01,614 I think you're going to find and in some cases you're going to have a hard time 238 00:22:01,614 --> 00:22:05,082 telling them apart. If your phone system has a much higher 239 00:22:05,082 --> 00:22:10,394 cutoff frequency, then those get through. So you can judge what's going on in the 240 00:22:10,394 --> 00:22:14,902 phone system by playing the, a little word recognition game. 241 00:22:14,902 --> 00:22:20,076 Now, there's something else I want to point out here, and that is the highest 242 00:22:20,076 --> 00:22:23,787 frequency in this plot is about 5,500 hertz. 243 00:22:23,787 --> 00:22:29,432 And we'll go into why that is when we talk about digital signal processing. 244 00:22:29,432 --> 00:22:35,577 it turns out there are some signals, speech signals, that you can have higher 245 00:22:35,577 --> 00:22:39,909 frequency content. and we usually say that the bandwidth is 246 00:22:39,909 --> 00:22:43,739 about 6-7 kHz. It's very approximate, because you are 247 00:22:43,739 --> 00:22:49,145 seeing the transfer function for the vowels is going down, as frequency goes 248 00:22:49,145 --> 00:22:54,112 up generally, beyond about 4-5 kHz. It's definitely going down, so the 249 00:22:54,112 --> 00:22:58,435 bandwidth is only a rough measure of what's going on. 250 00:22:58,435 --> 00:23:03,619 So, we now can see a lot of things about the structure of speech. 251 00:23:03,619 --> 00:23:09,606 Now I claim, if you look at the Spectrogram, there's something not speech 252 00:23:09,606 --> 00:23:15,058 like in the Spectrogram. There's something here that isn't right. 253 00:23:15,058 --> 00:23:20,943 And in particular. I want to point out this line, it's going 254 00:23:20,943 --> 00:23:26,224 across here. What in the world does that correspond 255 00:23:26,224 --> 00:23:34,162 to? There are two answers, one is what kind of signal would produce that line, 256 00:23:34,162 --> 00:23:38,742 and what's physically producing that line. 257 00:23:38,742 --> 00:23:47,319 Okay, now, physically what that looks like, in term of signals, it turns out 258 00:23:47,319 --> 00:23:55,296 it's, looks line a sinewave, although frequency of that, oh, 1800 Hertz. 259 00:23:55,296 --> 00:23:59,180 Well that's pretty high. It's certainly not power frequency and it 260 00:23:59,180 --> 00:24:01,773 turns out that's fan noise from my computer. 261 00:24:01,773 --> 00:24:06,141 I couldn't get my computer to turn its fan off when I made this recording and 262 00:24:06,141 --> 00:24:09,641 you're picking that up. And you can even see there's another 263 00:24:09,641 --> 00:24:12,131 piece of it up there at a higher frequency. 264 00:24:12,131 --> 00:24:15,697 Okay. So, we now know what the structure of 265 00:24:15,697 --> 00:24:19,958 speech is. First of all is has a band width of about 266 00:24:19,958 --> 00:24:25,033 6 or 7 kilohertz and that's important in many applications. 267 00:24:25,033 --> 00:24:31,562 the structure of speech is best seen in the Frequency domain there are some time 268 00:24:31,562 --> 00:24:36,442 domain things you can pick out like what the pitch period is it could be pretty 269 00:24:36,442 --> 00:24:38,862 easy. But it's hard to see how pitch is 270 00:24:38,862 --> 00:24:43,952 changing very easily with time until you look at the spectrogram, you look in the 271 00:24:43,952 --> 00:24:49,371 frequency domain. I want to point out,that many other 272 00:24:49,371 --> 00:24:59,582 signals,share the same frequency range as pitch and signal processing systems they, 273 00:24:59,582 --> 00:25:05,831 They exploit in great detail this special structure of speech, so that they can 274 00:25:05,831 --> 00:25:11,649 send speech signals efficiently. And these are usually called vocoders, 275 00:25:11,649 --> 00:25:19,157 and that is short poor voice coder. They typically, take speech in a way I 276 00:25:19,157 --> 00:25:24,077 like to put it. Rip apart the speech signal into the 277 00:25:24,077 --> 00:25:30,612 pitch part and the vocal track part. They actually try to split that signal up 278 00:25:30,612 --> 00:25:34,832 into those 2 pieces. That voice coder is embedded into cell 279 00:25:34,832 --> 00:25:40,792 phones and in particular, they try, they use very efficient way at computing. 280 00:25:40,792 --> 00:25:45,934 Communicating with the vocal tract is, more more so than just sending the raw 281 00:25:45,934 --> 00:25:50,593 speech signal that would, you would normally do if you just had some regular 282 00:25:50,593 --> 00:25:54,955 six of seven kilohertz signal. Now it turns out voice, vocecoders have 283 00:25:54,955 --> 00:25:59,147 slipped into the music arena. And I'm sure if you've heard modern 284 00:25:59,147 --> 00:26:02,852 hip-hop, what they're doing is exactly what vocoder does. 285 00:26:02,852 --> 00:26:08,107 They're ripping apart the signal into the vocal track part and, and the pitch part. 286 00:26:08,107 --> 00:26:11,322 They're playing with the pitch part a little bit. 287 00:26:11,322 --> 00:26:15,752 Distorting it from being a pure set of pulses, and then putting it through a 288 00:26:15,752 --> 00:26:18,927 filter that looks like the vocal tract they measured. 289 00:26:18,927 --> 00:26:22,577 And it turns out the music sounds, at least to my ears, weird. 290 00:26:22,577 --> 00:26:27,777 but that's how voice coders work. And by the way, in a few modern computer 291 00:26:27,777 --> 00:26:33,005 technology and signal processing systems are such that you can do this in real 292 00:26:33,005 --> 00:26:36,024 time. You can actually do this while the singer 293 00:26:36,024 --> 00:26:41,277 is singing, it's kind of interesting. But for our purposes what they're doing 294 00:26:41,277 --> 00:26:45,716 and what's important is they're exploiting this special structure of 295 00:26:45,716 --> 00:26:48,831 speech. And you will learn that the more you know 296 00:26:48,831 --> 00:26:52,407 about signals. More effectively and efficiently you can 297 00:26:52,407 --> 00:26:54,405 communicate those kind of signals.