1 00:00:00,012 --> 00:00:04,682 So, in this video, we're going to continue our discussion of computing the 2 00:00:04,682 --> 00:00:09,615 spectra of discrete time signals. We'll go into some more practical aspects 3 00:00:09,615 --> 00:00:14,673 of how you compute these spectra. This falls in the, regime, what's known 4 00:00:14,673 --> 00:00:19,177 as spectral analysis. it's a technical term which means, the 5 00:00:19,177 --> 00:00:24,048 details of computing spectra. That are very realistic and reflect the 6 00:00:24,048 --> 00:00:27,763 signal, not some artifacts of the signal processing. 7 00:00:27,763 --> 00:00:33,426 We'll have to talk about windowing which refers to extracting sections of a longer 8 00:00:33,426 --> 00:00:38,063 signal for spectral analysis. You want to do that very carfully and 9 00:00:38,063 --> 00:00:41,087 corectly so you don't introduce artifacts. 10 00:00:41,087 --> 00:00:46,612 And this whole thing Put together in what's called short-time Fourier analysis 11 00:00:46,612 --> 00:00:50,737 discovering how the spectrum of a signal changes with time. 12 00:00:50,737 --> 00:00:56,196 We've already encountered this already this is the speech spectrogram so we're 13 00:00:56,196 --> 00:01:01,062 going to reveal in this video how the speech spectrogram is computed. 14 00:01:01,062 --> 00:01:04,857 OK. So here is the speech spectrogram I've 15 00:01:04,857 --> 00:01:11,606 showed you in the previous video and in more detail what we have now is a long 16 00:01:11,606 --> 00:01:17,845 signal this is over 1.2 seconds long, sampled at a very high rate. 17 00:01:17,845 --> 00:01:22,712 And you can tell by looking at the wave form and time. 18 00:01:22,712 --> 00:01:29,001 That is characteristics are changing continuing throughout the whole segment. 19 00:01:29,001 --> 00:01:34,861 So what we want to capture now in the frequency domain, what's happening, what 20 00:01:34,861 --> 00:01:40,662 do those spectral look like as we go through the signal? And basically the 21 00:01:40,662 --> 00:01:46,043 idea is that we extract small sections of the A wave form and we're going to 22 00:01:46,043 --> 00:01:51,573 compute their transforms and it turns out that extraction of those pieces turns out 23 00:01:51,573 --> 00:01:56,635 to be very important and you've got to do it carefully or else you're going to 24 00:01:56,635 --> 00:02:00,709 introduce artifacts. So before we go into the details of that 25 00:02:00,709 --> 00:02:08,622 let me ask you a question As you noted before the highest frequency here is 5.5 26 00:02:08,622 --> 00:02:14,592 kilohertz. Notice a sampling word I used to digitize 27 00:02:14,592 --> 00:02:20,802 the analog speech signal to ridge it into my computer. 28 00:02:20,802 --> 00:02:25,079 Alright. So, you should have gotten that it has to 29 00:02:25,079 --> 00:02:31,152 be twice the highest frequency. So the correct answer is 11 kilohertz. 30 00:02:31,152 --> 00:02:37,932 Now, 11 kilohertz may seem like kind of an odd number, until you look up what the 31 00:02:37,932 --> 00:02:42,070 sampling rate is for. Out of a compact disk for CDs. 32 00:02:42,070 --> 00:02:45,707 I think you'll quickly figure out why I love them. 33 00:02:45,707 --> 00:02:49,640 Is the reason for why it's, how it's related, rather. 34 00:02:49,640 --> 00:02:52,842 To the CD sample, it's kind of interesting. 35 00:02:52,842 --> 00:02:54,853 most computers sample at 11 kHz, about that. 36 00:02:56,650 --> 00:03:01,912 Alright, let's go into the details of what we were just talking about. 37 00:03:01,912 --> 00:03:09,132 So here we have a long signal and we're going to chop it up into pieces. 38 00:03:09,132 --> 00:03:17,195 And these pieces are called sections. And the idea is that I'm going, for each 39 00:03:17,195 --> 00:03:23,319 section I'm going to compute it's DFT. And evaluate the spectrum. 40 00:03:23,319 --> 00:03:28,504 OK, well turns out there's a little problem with doing that directly which we 41 00:03:28,504 --> 00:03:33,764 need to explore which means we need to be a bit more precise so that value we take 42 00:03:33,764 --> 00:03:38,925 out a section what does that really mean? So what that really means is that you 43 00:03:38,925 --> 00:03:43,177 have a long signal. Which you have multiplied by what amounts 44 00:03:43,177 --> 00:03:48,285 to a rectangular pulse and in the spectral analysis world this is known as 45 00:03:48,285 --> 00:03:53,368 a window because it's through this pulse that you're viewing the signal. 46 00:03:53,368 --> 00:03:59,100 You're not seeing anything else on either side, you're viewing the signal through 47 00:03:59,100 --> 00:04:02,716 the window. And of course the word rectangular 48 00:04:02,716 --> 00:04:07,760 follows from it's shape. Well, let's look at a, an example here in 49 00:04:07,760 --> 00:04:13,320 a bit more detail to see what the effect is of multiplying by this window. 50 00:04:13,320 --> 00:04:19,310 So suppose we have a signal that looks something like that and we multiply it by 51 00:04:19,310 --> 00:04:28,516 a rectangular window, which Curves whenever it occurs and the result is 52 00:04:28,516 --> 00:04:35,365 going to be something that looks like this. 53 00:04:35,365 --> 00:04:43,432 And the problem is, occurs at the edges, this jump. 54 00:04:43,432 --> 00:04:47,062 Not a very big jump here, but a very big jump here. 55 00:04:47,062 --> 00:04:50,182 Well, that was not in the original signal. 56 00:04:50,182 --> 00:04:53,359 The original signal was a smooth blue line. 57 00:04:53,359 --> 00:04:56,536 What these jumps create, the edge effects. 58 00:04:56,536 --> 00:05:02,112 What they create are these sections in the spectrum which don't look right. 59 00:05:02,112 --> 00:05:07,697 Usually at the high frequency edges, and so, we know this is a speech spectrum, 60 00:05:07,697 --> 00:05:11,843 and this is clearly not indicative of the speech spectrum. 61 00:05:11,843 --> 00:05:15,964 It's entirely and artifact of using a rectangular window. 62 00:05:15,964 --> 00:05:21,202 It's all due to the edge effects, and so, we clearly want to minimize that. 63 00:05:21,202 --> 00:05:27,306 How you do that is to selecting a window which gracefully goes to zero at the 64 00:05:27,306 --> 00:05:31,261 edges. So, we're going to use this what's called 65 00:05:31,261 --> 00:05:35,843 a canyon window. Turns out it is a one cycle of a sinusoid 66 00:05:35,843 --> 00:05:42,131 that's been made, it's raised up to be positive and has a maximum amplitude of 67 00:05:42,131 --> 00:05:45,070 one. But it equals zero at the edges. 68 00:05:45,070 --> 00:05:50,561 So, we can see now that the edge effects can't be there, and now we get a 69 00:05:50,561 --> 00:05:56,832 spectrum, once we take the transform in the high frequency region, that greatly 70 00:05:56,832 --> 00:06:00,954 resembles the speech spectrum that we know is there. 71 00:06:00,954 --> 00:06:04,562 So, no artifacts. We've gotten rid of them. 72 00:06:04,562 --> 00:06:10,747 Just by using the Hanning window, well it turns there's another little problem with 73 00:06:10,747 --> 00:06:14,234 the Hanning window which we need to talk about. 74 00:06:14,234 --> 00:06:19,697 Before I get too far along I'm going to talk about some other details here. 75 00:06:19,697 --> 00:06:25,572 Note that I used a length 256 section and I'm using a length 512 transform So I am 76 00:06:25,572 --> 00:06:30,369 using a longer transform than the length of the section, and we understand that 77 00:06:30,369 --> 00:06:34,990 I'm interested in seeing the spectral details, so that makes a lot of sense. 78 00:06:34,990 --> 00:06:39,527 I could have taken an even longer transform if I wanted to, but for this 79 00:06:39,527 --> 00:06:43,762 example, I only took one twice as long. Now, this one is a power of 2. 80 00:06:43,762 --> 00:06:49,172 There's no reason why the original section has to be a power of 2. 81 00:06:49,172 --> 00:06:53,537 I just used powers of 2 cause I'm use to doing it. 82 00:06:53,537 --> 00:06:59,092 I could have used 255 or 308, if I wanted to, didn't really matter. 83 00:06:59,092 --> 00:07:04,777 But I have to pick a power of 2 For the transform length, because I'm using the 84 00:07:04,777 --> 00:07:07,860 FFT. And believe me, when you're computing 85 00:07:07,860 --> 00:07:12,486 spectrograms, you want to use the FFT. So, this is where the power of 2 is 86 00:07:12,486 --> 00:07:15,976 absolutely necessary, but not so for the sectionals. 87 00:07:15,976 --> 00:07:22,048 Well, what's the problem with using the Hamming window? Well, if you look at what 88 00:07:22,048 --> 00:07:26,213 happens her., Here are the section boundaries again. 89 00:07:26,213 --> 00:07:31,149 And if you look at what you're essentially doing when you apply a 90 00:07:31,149 --> 00:07:37,589 hamming window to each section, is that you're ignoring large fractions, portions 91 00:07:37,589 --> 00:07:43,007 of the data that could be important because the window goes to 0 At the 92 00:07:43,007 --> 00:07:48,176 boundaries of, from these sections. What's happening in those, in these 93 00:07:48,176 --> 00:07:53,603 regions, essentially gets set to 0. So, you never see them in this spectrum, 94 00:07:53,603 --> 00:07:58,057 they're going to be gone. How do you fix that? And the idea is to 95 00:07:58,057 --> 00:08:02,332 use overlapping windows. So, the idea is that we overlap the 96 00:08:02,332 --> 00:08:06,817 windows. One after another and producing a picture 97 00:08:06,817 --> 00:08:13,092 that looks more like this and now all of the signal gets through and I've 98 00:08:13,092 --> 00:08:20,042 overlapped here by a half; here's the original section length, here's the next 99 00:08:20,042 --> 00:08:24,092 section length. And I've overlapped by a half here of 100 00:08:24,092 --> 00:08:28,473 this section length. You can overlap by more, so that the 101 00:08:28,473 --> 00:08:31,559 spectra, the windows come more frequently. 102 00:08:31,559 --> 00:08:37,029 If you want to see more temporal detail, more time detail in how the spectrum's 103 00:08:37,029 --> 00:08:39,360 changing. you may want less. 104 00:08:39,360 --> 00:08:44,459 You can move it over some. You definitely don't want to move it over 105 00:08:44,459 --> 00:08:49,308 too much, else you'd be ignoring parts of the original signal. 106 00:08:49,308 --> 00:08:54,468 So now we've got all the data come through and now we can compute the 107 00:08:54,468 --> 00:08:58,397 spectrogram. So here's the big picture, you take a 108 00:08:58,397 --> 00:09:04,107 long signal You use any windows or something like it to go smoothly to the 109 00:09:04,107 --> 00:09:10,183 edge you overlap the sections so that you don't miss anything in the data and now 110 00:09:10,183 --> 00:09:13,787 you can take a fully transform of each section. 111 00:09:13,787 --> 00:09:19,042 And here's why you use the FFT. Because of the overlap by half, I am 112 00:09:19,042 --> 00:09:24,167 actually computing twice as many Fourier transforms as I did in the original 113 00:09:24,167 --> 00:09:29,692 setup, and so I'm doing lots and lots of transforms, but I'm getting very accurate 114 00:09:29,692 --> 00:09:33,092 answers. If it wasn't for the speed and efficiency 115 00:09:33,092 --> 00:09:37,717 of the FFT, I couldn't do this. It would take a, way too long for me to 116 00:09:37,717 --> 00:09:43,379 be patient enough to wait for the answer. Once I get these transforms I now have 117 00:09:43,379 --> 00:09:47,207 spectra and I can display them in all kinds of ways. 118 00:09:47,207 --> 00:09:53,201 We're going to display them as an image, you could display them other ways, but I 119 00:09:53,201 --> 00:09:59,086 do want to point out that now you can do things like track this peak through here 120 00:09:59,086 --> 00:10:05,203 and see how it changes in time. Where it's location and frequency is, 121 00:10:05,203 --> 00:10:10,726 changes through time. We get a very good idea of what the 122 00:10:10,726 --> 00:10:15,835 structure of the signal is in the frequency domain. 123 00:10:15,835 --> 00:10:23,101 So here's our spectrogram and so what I did What really what the display is is 124 00:10:23,101 --> 00:10:28,668 that every column of this image is a spectrum, computing using the FFT. 125 00:10:28,668 --> 00:10:34,222 we then display the value of that spectrum as a color and a heat map. 126 00:10:34,222 --> 00:10:40,304 And, you can see by the fact you can't see the quantization and image, that I'm 127 00:10:40,304 --> 00:10:46,513 confusing lots and lots of transforms and that's just the way it is. 128 00:10:46,513 --> 00:10:53,956 and it turns out, because of the FFT, I can compute speech spectrogram in real 129 00:10:53,956 --> 00:10:57,742 time. What that means is I can compute the 130 00:10:57,742 --> 00:11:03,671 spectra just as fast as the data are being sampled by the computer. 131 00:11:04,838 --> 00:11:08,325 That's the efficiency and the value of using the FFT. 132 00:11:08,325 --> 00:11:13,226 It's really really very important. On a more technical note, the thing you 133 00:11:13,226 --> 00:11:18,275 have to do when you're using the, spectrogram, is, you have to determine 134 00:11:18,275 --> 00:11:21,908 three things. You have to determine the window length. 135 00:11:21,908 --> 00:11:25,312 How much they overlap. And the transform length. 136 00:11:25,312 --> 00:11:29,780 In most cases, the transform length is longer than the window length. 137 00:11:29,780 --> 00:11:34,749 It depends how much detail you want in the, spectrum that you're trying to 138 00:11:34,749 --> 00:11:37,814 examine. The window length is determined by how 139 00:11:37,814 --> 00:11:41,046 rapidly things are changing in time in the signal. 140 00:11:41,046 --> 00:11:45,895 So that's where the temporal structure. The signal becomes important. 141 00:11:45,895 --> 00:11:49,749 In the overlap, a half is a normal default kind of overlap. 142 00:11:49,749 --> 00:11:54,748 You may want more overlap to get more detail of how the spectrum is changing. 143 00:11:54,748 --> 00:11:59,699 If you use much less than a half you may not be happy with the results because 144 00:11:59,699 --> 00:12:03,182 then you'd tend to be missing parts of the signal. 145 00:12:03,182 --> 00:12:07,897 With these kind of details and a lot of experience, you too can compute, compute 146 00:12:07,897 --> 00:12:12,487 a speech spectrogram that's accurate, accurately reflects what's going on in 147 00:12:12,487 --> 00:12:13,110 the signal.