1
00:00:00,000 --> 00:00:05,055
In this video, we're going to look at the 
issue of training deep autoencoders. 

2
00:00:05,055 --> 00:00:08,797
People thought of these a long time ago, 
in the mid 1980s. 

3
00:00:08,797 --> 00:00:13,852
But they simply couldn't train them well 
enough for them to do significantly 

4
00:00:13,852 --> 00:00:16,609
better than principal components 
analysis. 

5
00:00:16,609 --> 00:00:21,664
There were various papers published about 
them, but no good demonstrations of 

6
00:00:21,664 --> 00:00:25,144
impressive performance. 
After we developed methods of 

7
00:00:25,144 --> 00:00:28,230
pre-training deep networks one layer at a 
time. 

8
00:00:28,230 --> 00:00:33,128
Russ Salakhutdinov and I applied these 
methods to pretraining deep autoencoders, 

9
00:00:33,128 --> 00:00:37,468
and for the first time, we got much 
better representations out of deep 

10
00:00:37,468 --> 00:00:41,560
autoencoders than we could get from 
principal components analysis. 

11
00:00:41,560 --> 00:00:46,517
Deep autoencoders always seemed like a 
really nice way to do dimensionality 

12
00:00:46,517 --> 00:00:51,539
reduction because it seemed like they 
should work much better than principal 

13
00:00:51,539 --> 00:00:55,387
components analysis. 
They provide flexible mappings in both 

14
00:00:55,387 --> 00:00:58,387
directions, and the mappings can be 
non-linear. 

15
00:00:58,387 --> 00:01:03,540
Their learning time should be linear or 
better in the number of training cases. 

16
00:01:03,540 --> 00:01:08,367
And after they've been learned, the 
encoding part of the network is fairly 

17
00:01:08,367 --> 00:01:12,150
fast because it's just a matrix 
multiplier for each layer. 

18
00:01:12,150 --> 00:01:18,041
Unfortunately, it was very difficult to 
optimize deep autoencoders using back 

19
00:01:18,041 --> 00:01:21,498
propagation. 
Typically people try small initial 

20
00:01:21,498 --> 00:01:27,012
weights, and then the back propagated 
gradient died, so for deep network, they 

21
00:01:27,012 --> 00:01:31,364
never got off the ground. 
But now we have much better way to 

22
00:01:31,364 --> 00:01:35,518
optimize them. 
We can use unsupervised layer by layer 

23
00:01:35,518 --> 00:01:39,702
pre-training, 
or we can simply initialize the weight 

24
00:01:39,702 --> 00:01:45,978
sensibly, as an echo statement it works. 
The first really successful deep water 

25
00:01:45,978 --> 00:01:50,404
encoders were learned by Russ 
Salakhutdinov and I in 2006. 

26
00:01:50,404 --> 00:01:56,197
We applied them to the N-ness digits. 
So we started with images with 784 

27
00:01:56,197 --> 00:01:59,657
pixels. 
And we then encoded those via three 

28
00:01:59,657 --> 00:02:05,210
hidden layers, into 30 real valued 
activities in a central code layer. 

29
00:02:05,210 --> 00:02:09,620
We then decoded those 30 real valued 
activities, 

30
00:02:09,620 --> 00:02:15,440
back to 784 reconstructed pixels. 
We used a stack of restricted Boltzmann 

31
00:02:15,440 --> 00:02:19,530
machine to initialize the weights used 
for encoding, 

32
00:02:19,530 --> 00:02:25,744
and we then took the transposers of those 
weights and initialized the decoding 

33
00:02:25,744 --> 00:02:30,348
network with them. 
So initially, the 784 pixels were 

34
00:02:30,348 --> 00:02:36,799
reconstructed, using a weight matrix that 
was just the transpose of the weight 

35
00:02:36,799 --> 00:02:42,448
matrix used for encoding them. 
But after the four restricted Boltzmann 

36
00:02:42,448 --> 00:02:47,940
machines have being trained and unrolled 
to give the transposes for decoding, 

37
00:02:47,940 --> 00:02:54,123
we then applied back propagation to 
minimize the reconstruction error of the 

38
00:02:54,123 --> 00:02:57,335
784 pixels. 
In this case we were using a 

39
00:02:57,335 --> 00:03:03,277
cross-entropy error, because the pixels 
were represented by logistic units. 

40
00:03:03,277 --> 00:03:08,255
So that error was back propagated through 
this whole deep net. 

41
00:03:08,255 --> 00:03:12,030
And we once started back propagating the 
error, 

42
00:03:12,030 --> 00:03:17,331
the weights used for reconstructing the 
pixels became different from the weights 

43
00:03:17,331 --> 00:03:21,716
used for encoding the pixels. 
Although they, typically stayed fairly 

44
00:03:21,716 --> 00:03:23,680
similar. 
This worked very well. 

45
00:03:24,760 --> 00:03:31,025
So if you look at the first row, that's 
one random sample from each digit class. 

46
00:03:31,025 --> 00:03:37,371
If you look at the second row, that's the 
reconstruction of the random sample by 

47
00:03:37,371 --> 00:03:43,320
the deep autoencoder that uses 30 linear 
hidden units in its central layer. 

48
00:03:43,320 --> 00:03:47,886
So the data has been compressed to 30 
real numbers and then reconstructed. 

49
00:03:47,886 --> 00:03:52,391
If you look at the eight, you can see 
that the reconstruction is actually 

50
00:03:52,391 --> 00:03:56,279
better than the eight. 
It's got rid of the little defect in the 

51
00:03:56,279 --> 00:03:59,735
eight because it doesn't have the 
capacity to encode it. 

52
00:03:59,735 --> 00:04:04,795
If you compare that with linear principal 
commands analysis, you can see it's much 

53
00:04:04,795 --> 00:04:07,449
better. 
A linear mapping to 30 real numbers 

54
00:04:07,449 --> 00:04:10,905
cannot do nearly as good a job of 
representing the data.