1
00:00:00,012 --> 00:00:04,682
So, in this video, we're going to 
continue our discussion of computing the 

2
00:00:04,682 --> 00:00:09,615
spectra of discrete time signals. 
We'll go into some more practical aspects 

3
00:00:09,615 --> 00:00:14,673
of how you compute these spectra. 
This falls in the, regime, what's known 

4
00:00:14,673 --> 00:00:19,177
as spectral analysis. 
it's a technical term which means, the 

5
00:00:19,177 --> 00:00:24,048
details of computing spectra. 
That are very realistic and reflect the 

6
00:00:24,048 --> 00:00:27,763
signal, not some artifacts of the signal 
processing. 

7
00:00:27,763 --> 00:00:33,426
We'll have to talk about windowing which 
refers to extracting sections of a longer 

8
00:00:33,426 --> 00:00:38,063
signal for spectral analysis. 
You want to do that very carfully and 

9
00:00:38,063 --> 00:00:41,087
corectly so you don't introduce 
artifacts. 

10
00:00:41,087 --> 00:00:46,612
And this whole thing Put together in 
what's called short-time Fourier analysis 

11
00:00:46,612 --> 00:00:50,737
discovering how the spectrum of a signal 
changes with time. 

12
00:00:50,737 --> 00:00:56,196
We've already encountered this already 
this is the speech spectrogram so we're 

13
00:00:56,196 --> 00:01:01,062
going to reveal in this video how the 
speech spectrogram is computed. 

14
00:01:01,062 --> 00:01:04,857
OK. 
So here is the speech spectrogram I've 

15
00:01:04,857 --> 00:01:11,606
showed you in the previous video and in 
more detail what we have now is a long 

16
00:01:11,606 --> 00:01:17,845
signal this is over 1.2 seconds long, 
sampled at a very high rate. 

17
00:01:17,845 --> 00:01:22,712
And you can tell by looking at the wave 
form and time. 

18
00:01:22,712 --> 00:01:29,001
That is characteristics are changing 
continuing throughout the whole segment. 

19
00:01:29,001 --> 00:01:34,861
So what we want to capture now in the 
frequency domain, what's happening, what 

20
00:01:34,861 --> 00:01:40,662
do those spectral look like as we go 
through the signal? And basically the 

21
00:01:40,662 --> 00:01:46,043
idea is that we extract small sections of 
the A wave form and we're going to 

22
00:01:46,043 --> 00:01:51,573
compute their transforms and it turns out 
that extraction of those pieces turns out 

23
00:01:51,573 --> 00:01:56,635
to be very important and you've got to do 
it carefully or else you're going to 

24
00:01:56,635 --> 00:02:00,709
introduce artifacts. 
So before we go into the details of that 

25
00:02:00,709 --> 00:02:08,622
let me ask you a question As you noted 
before the highest frequency here is 5.5 

26
00:02:08,622 --> 00:02:14,592
kilohertz. 
Notice a sampling word I used to digitize 

27
00:02:14,592 --> 00:02:20,802
the analog speech signal to ridge it into 
my computer. 

28
00:02:20,802 --> 00:02:25,079
Alright. 
So, you should have gotten that it has to 

29
00:02:25,079 --> 00:02:31,152
be twice the highest frequency. 
So the correct answer is 11 kilohertz. 

30
00:02:31,152 --> 00:02:37,932
Now, 11 kilohertz may seem like kind of 
an odd number, until you look up what the 

31
00:02:37,932 --> 00:02:42,070
sampling rate is for. 
Out of a compact disk for CDs. 

32
00:02:42,070 --> 00:02:45,707
I think you'll quickly figure out why I 
love them. 

33
00:02:45,707 --> 00:02:49,640
Is the reason for why it's, how it's 
related, rather. 

34
00:02:49,640 --> 00:02:52,842
To the CD sample, it's kind of 
interesting. 

35
00:02:52,842 --> 00:02:54,853
most computers sample at 11 kHz, 
about that. 

36
00:02:56,650 --> 00:03:01,912
Alright, let's go into the details of 
what we were just talking about. 

37
00:03:01,912 --> 00:03:09,132
So here we have a long signal and we're 
going to chop it up into pieces. 

38
00:03:09,132 --> 00:03:17,195
And these pieces are called sections. 
And the idea is that I'm going, for each 

39
00:03:17,195 --> 00:03:23,319
section I'm going to compute it's DFT. 
And evaluate the spectrum. 

40
00:03:23,319 --> 00:03:28,504
OK, well turns out there's a little 
problem with doing that directly which we 

41
00:03:28,504 --> 00:03:33,764
need to explore which means we need to be 
a bit more precise so that value we take 

42
00:03:33,764 --> 00:03:38,925
out a section what does that really mean? 
So what that really means is that you 

43
00:03:38,925 --> 00:03:43,177
have a long signal. 
Which you have multiplied by what amounts 

44
00:03:43,177 --> 00:03:48,285
to a rectangular pulse and in the 
spectral analysis world this is known as 

45
00:03:48,285 --> 00:03:53,368
a window because it's through this pulse 
that you're viewing the signal. 

46
00:03:53,368 --> 00:03:59,100
You're not seeing anything else on either 
side, you're viewing the signal through 

47
00:03:59,100 --> 00:04:02,716
the window. 
And of course the word rectangular 

48
00:04:02,716 --> 00:04:07,760
follows from it's shape. 
Well, let's look at a, an example here in 

49
00:04:07,760 --> 00:04:13,320
a bit more detail to see what the effect 
is of multiplying by this window. 

50
00:04:13,320 --> 00:04:19,310
So suppose we have a signal that looks 
something like that and we multiply it by 

51
00:04:19,310 --> 00:04:28,516
a rectangular window, which Curves 
whenever it occurs and the result is 

52
00:04:28,516 --> 00:04:35,365
going to be something that looks like 
this. 

53
00:04:35,365 --> 00:04:43,432
And the problem is, occurs at the edges, 
this jump. 

54
00:04:43,432 --> 00:04:47,062
Not a very big jump here, but a very big 
jump here. 

55
00:04:47,062 --> 00:04:50,182
Well, that was not in the original 
signal. 

56
00:04:50,182 --> 00:04:53,359
The original signal was a smooth blue 
line. 

57
00:04:53,359 --> 00:04:56,536
What these jumps create, the edge 
effects. 

58
00:04:56,536 --> 00:05:02,112
What they create are these sections in 
the spectrum which don't look right. 

59
00:05:02,112 --> 00:05:07,697
Usually at the high frequency edges, and 
so, we know this is a speech spectrum, 

60
00:05:07,697 --> 00:05:11,843
and this is clearly not indicative of the 
speech spectrum. 

61
00:05:11,843 --> 00:05:15,964
It's entirely and artifact of using a 
rectangular window. 

62
00:05:15,964 --> 00:05:21,202
It's all due to the edge effects, and so, 
we clearly want to minimize that. 

63
00:05:21,202 --> 00:05:27,306
How you do that is to selecting a window 
which gracefully goes to zero at the 

64
00:05:27,306 --> 00:05:31,261
edges. 
So, we're going to use this what's called 

65
00:05:31,261 --> 00:05:35,843
a canyon window. 
Turns out it is a one cycle of a sinusoid 

66
00:05:35,843 --> 00:05:42,131
that's been made, it's raised up to be 
positive and has a maximum amplitude of 

67
00:05:42,131 --> 00:05:45,070
one. 
But it equals zero at the edges. 

68
00:05:45,070 --> 00:05:50,561
So, we can see now that the edge effects 
can't be there, and now we get a 

69
00:05:50,561 --> 00:05:56,832
spectrum, once we take the transform in 
the high frequency region, that greatly 

70
00:05:56,832 --> 00:06:00,954
resembles the speech spectrum that we 
know is there. 

71
00:06:00,954 --> 00:06:04,562
So, no artifacts. 
We've gotten rid of them. 

72
00:06:04,562 --> 00:06:10,747
Just by using the Hanning window, well it 
turns there's another little problem with 

73
00:06:10,747 --> 00:06:14,234
the Hanning window which we need to talk 
about. 

74
00:06:14,234 --> 00:06:19,697
Before I get too far along I'm going to 
talk about some other details here. 

75
00:06:19,697 --> 00:06:25,572
Note that I used a length 256 section and 
I'm using a length 512 transform So I am 

76
00:06:25,572 --> 00:06:30,369
using a longer transform than the length 
of the section, and we understand that 

77
00:06:30,369 --> 00:06:34,990
I'm interested in seeing the spectral 
details, so that makes a lot of sense. 

78
00:06:34,990 --> 00:06:39,527
I could have taken an even longer 
transform if I wanted to, but for this 

79
00:06:39,527 --> 00:06:43,762
example, I only took one twice as long. 
Now, this one is a power of 2. 

80
00:06:43,762 --> 00:06:49,172
There's no reason why the original 
section has to be a power of 2. 

81
00:06:49,172 --> 00:06:53,537
I just used powers of 2 cause I'm use to 
doing it. 

82
00:06:53,537 --> 00:06:59,092
I could have used 255 or 308, if I wanted 
to, didn't really matter. 

83
00:06:59,092 --> 00:07:04,777
But I have to pick a power of 2 For the 
transform length, because I'm using the 

84
00:07:04,777 --> 00:07:07,860
FFT. 
And believe me, when you're computing 

85
00:07:07,860 --> 00:07:12,486
spectrograms, you want to use the FFT. 
So, this is where the power of 2 is 

86
00:07:12,486 --> 00:07:15,976
absolutely necessary, but not so for the 
sectionals. 

87
00:07:15,976 --> 00:07:22,048
Well, what's the problem with using the 
Hamming window? Well, if you look at what 

88
00:07:22,048 --> 00:07:26,213
happens her., Here are the section 
boundaries again. 

89
00:07:26,213 --> 00:07:31,149
And if you look at what you're 
essentially doing when you apply a 

90
00:07:31,149 --> 00:07:37,589
hamming window to each section, is that 
you're ignoring large fractions, portions 

91
00:07:37,589 --> 00:07:43,007
of the data that could be important 
because the window goes to 0 At the 

92
00:07:43,007 --> 00:07:48,176
boundaries of, from these sections. 
What's happening in those, in these 

93
00:07:48,176 --> 00:07:53,603
regions, essentially gets set to 0. 
So, you never see them in this spectrum, 

94
00:07:53,603 --> 00:07:58,057
they're going to be gone. 
How do you fix that? And the idea is to 

95
00:07:58,057 --> 00:08:02,332
use overlapping windows. 
So, the idea is that we overlap the 

96
00:08:02,332 --> 00:08:06,817
windows. 
One after another and producing a picture 

97
00:08:06,817 --> 00:08:13,092
that looks more like this and now all of 
the signal gets through and I've 

98
00:08:13,092 --> 00:08:20,042
overlapped here by a half; here's the 
original section length, here's the next 

99
00:08:20,042 --> 00:08:24,092
section length. 
And I've overlapped by a half here of 

100
00:08:24,092 --> 00:08:28,473
this section length. 
You can overlap by more, so that the 

101
00:08:28,473 --> 00:08:31,559
spectra, the windows come more 
frequently. 

102
00:08:31,559 --> 00:08:37,029
If you want to see more temporal detail, 
more time detail in how the spectrum's 

103
00:08:37,029 --> 00:08:39,360
changing. 
you may want less. 

104
00:08:39,360 --> 00:08:44,459
You can move it over some. 
You definitely don't want to move it over 

105
00:08:44,459 --> 00:08:49,308
too much, else you'd be ignoring parts of 
the original signal. 

106
00:08:49,308 --> 00:08:54,468
So now we've got all the data come 
through and now we can compute the 

107
00:08:54,468 --> 00:08:58,397
spectrogram. 
So here's the big picture, you take a 

108
00:08:58,397 --> 00:09:04,107
long signal You use any windows or 
something like it to go smoothly to the 

109
00:09:04,107 --> 00:09:10,183
edge you overlap the sections so that you 
don't miss anything in the data and now 

110
00:09:10,183 --> 00:09:13,787
you can take a fully transform of each 
section. 

111
00:09:13,787 --> 00:09:19,042
And here's why you use the FFT. 
Because of the overlap by half, I am 

112
00:09:19,042 --> 00:09:24,167
actually computing twice as many Fourier 
transforms as I did in the original 

113
00:09:24,167 --> 00:09:29,692
setup, and so I'm doing lots and lots of 
transforms, but I'm getting very accurate 

114
00:09:29,692 --> 00:09:33,092
answers. 
If it wasn't for the speed and efficiency 

115
00:09:33,092 --> 00:09:37,717
of the FFT, I couldn't do this. 
It would take a, way too long for me to 

116
00:09:37,717 --> 00:09:43,379
be patient enough to wait for the answer. 
Once I get these transforms I now have 

117
00:09:43,379 --> 00:09:47,207
spectra and I can display them in all 
kinds of ways. 

118
00:09:47,207 --> 00:09:53,201
We're going to display them as an image, 
you could display them other ways, but I 

119
00:09:53,201 --> 00:09:59,086
do want to point out that now you can do 
things like track this peak through here 

120
00:09:59,086 --> 00:10:05,203
and see how it changes in time. 
Where it's location and frequency is, 

121
00:10:05,203 --> 00:10:10,726
changes through time. 
We get a very good idea of what the 

122
00:10:10,726 --> 00:10:15,835
structure of the signal is in the 
frequency domain. 

123
00:10:15,835 --> 00:10:23,101
So here's our spectrogram and so what I 
did What really what the display is is 

124
00:10:23,101 --> 00:10:28,668
that every column of this image is a 
spectrum, computing using the FFT. 

125
00:10:28,668 --> 00:10:34,222
we then display the value of that 
spectrum as a color and a heat map. 

126
00:10:34,222 --> 00:10:40,304
And, you can see by the fact you can't 
see the quantization and image, that I'm 

127
00:10:40,304 --> 00:10:46,513
confusing lots and lots of transforms and 
that's just the way it is. 

128
00:10:46,513 --> 00:10:53,956
and it turns out, because of the FFT, I 
can compute speech spectrogram in real 

129
00:10:53,956 --> 00:10:57,742
time. 
What that means is I can compute the 

130
00:10:57,742 --> 00:11:03,671
spectra just as fast as the data are 
being sampled by the computer. 

131
00:11:04,838 --> 00:11:08,325
That's the efficiency and the value of 
using the FFT. 

132
00:11:08,325 --> 00:11:13,226
It's really really very important. 
On a more technical note, the thing you 

133
00:11:13,226 --> 00:11:18,275
have to do when you're using the, 
spectrogram, is, you have to determine 

134
00:11:18,275 --> 00:11:21,908
three things. 
You have to determine the window length. 

135
00:11:21,908 --> 00:11:25,312
How much they overlap. 
And the transform length. 

136
00:11:25,312 --> 00:11:29,780
In most cases, the transform length is 
longer than the window length. 

137
00:11:29,780 --> 00:11:34,749
It depends how much detail you want in 
the, spectrum that you're trying to 

138
00:11:34,749 --> 00:11:37,814
examine. 
The window length is determined by how 

139
00:11:37,814 --> 00:11:41,046
rapidly things are changing in time in 
the signal. 

140
00:11:41,046 --> 00:11:45,895
So that's where the temporal structure. 
The signal becomes important. 

141
00:11:45,895 --> 00:11:49,749
In the overlap, a half is a normal 
default kind of overlap. 

142
00:11:49,749 --> 00:11:54,748
You may want more overlap to get more 
detail of how the spectrum is changing. 

143
00:11:54,748 --> 00:11:59,699
If you use much less than a half you may 
not be happy with the results because 

144
00:11:59,699 --> 00:12:03,182
then you'd tend to be missing parts of 
the signal. 

145
00:12:03,182 --> 00:12:07,897
With these kind of details and a lot of 
experience, you too can compute, compute 

146
00:12:07,897 --> 00:12:12,487
a speech spectrogram that's accurate, 
accurately reflects what's going on in 

147
00:12:12,487 --> 00:12:13,110
the signal.