1
00:00:02,550 --> 00:00:07,476
Another useful way you can introduce hints is by the way you train the model.

2
00:00:07,476 --> 00:00:10,258
Say, you have a problem of image classification,

3
00:00:10,258 --> 00:00:12,700
but this time it's kind of near impossible.

4
00:00:12,700 --> 00:00:15,335
You only have 5,000 images, labeled ones,

5
00:00:15,335 --> 00:00:17,740
and you want to take an image of a person and

6
00:00:17,740 --> 00:00:20,580
use it to predict the style of his clothing.

7
00:00:20,580 --> 00:00:22,090
So, smart, formal whatever.

8
00:00:22,090 --> 00:00:25,075
In this case, it's really hard to obtain

9
00:00:25,075 --> 00:00:27,100
labelled images because you would have to

10
00:00:27,100 --> 00:00:30,285
label them all by hand or hire people to do this for you.

11
00:00:30,285 --> 00:00:32,910
The problem here is that if you fit a new letter,

12
00:00:32,910 --> 00:00:34,415
you actually have two choices.

13
00:00:34,415 --> 00:00:36,372
Either you use a normal-sized network,

14
00:00:36,372 --> 00:00:38,655
but since there is so little amount of data,

15
00:00:38,655 --> 00:00:41,810
it will probably over fit before it ever fits to anything.

16
00:00:41,810 --> 00:00:44,625
Another way is to use a highly over-regularized network domain

17
00:00:44,625 --> 00:00:47,145
like one layer and like two neurons.

18
00:00:47,145 --> 00:00:50,005
But in this case, it won't be able to learn anything better than a linear model,

19
00:00:50,005 --> 00:00:51,850
it's that small, remember.

20
00:00:51,850 --> 00:00:55,145
Fortunately, deep learning allows you to kill both birds with one stone.

21
00:00:55,145 --> 00:00:56,500
In this case, you can build

22
00:00:56,500 --> 00:00:59,995
a large network which would nevertheless not overtrain too much.

23
00:00:59,995 --> 00:01:05,130
We can do so by introducing other problems that it should also solve.

24
00:01:05,130 --> 00:01:08,170
Let's say that this clothing separation problem

25
00:01:08,170 --> 00:01:11,295
is too hard to obtain data on but we have a different problem.

26
00:01:11,295 --> 00:01:13,180
Let's say we have, for the same data,

27
00:01:13,180 --> 00:01:16,040
we have age and gender labeled.

28
00:01:16,040 --> 00:01:17,890
This is much simpler since you can probably

29
00:01:17,890 --> 00:01:19,780
extract a lot of such information from social networks.

30
00:01:19,780 --> 00:01:21,955
You can just parse everything you've got.

31
00:01:21,955 --> 00:01:24,100
It might be slightly illegal but let's assume you

32
00:01:24,100 --> 00:01:27,375
work in the social network company and you own this data.

33
00:01:27,375 --> 00:01:30,035
In this case, what you want to do,

34
00:01:30,035 --> 00:01:34,160
you want to train your network to predict both style and those age,

35
00:01:34,160 --> 00:01:36,470
gender, those kinds of features.

36
00:01:36,470 --> 00:01:39,120
And you want to do this the following way.

37
00:01:39,120 --> 00:01:41,060
So, you feed the portrait,

38
00:01:41,060 --> 00:01:44,350
the photo through the first layer and then the second one.

39
00:01:44,350 --> 00:01:46,925
And then, there is a split after which each kind of hand

40
00:01:46,925 --> 00:01:50,295
of your network is used to predict a different class.

41
00:01:50,295 --> 00:01:52,895
Now, this architecture express a very powerful idea.

42
00:01:52,895 --> 00:01:56,780
You say that you want your first two dense layers not only learn features useful for

43
00:01:56,780 --> 00:02:01,610
predicting the style of clothes but also useful in determining person's age or gender.

44
00:02:01,610 --> 00:02:04,355
Now, this is very useful because the second domain

45
00:02:04,355 --> 00:02:07,930
actually contains much more images and it's hard to overfit on it.

46
00:02:07,930 --> 00:02:11,400
So in this case, you'll learn all the features that are useful for both worlds

47
00:02:11,400 --> 00:02:15,585
but they won't to be able to overfit your problem before they ever fit.

48
00:02:15,585 --> 00:02:18,610
That is actually the features that suit this definition perfectly.

49
00:02:18,610 --> 00:02:21,055
For example, if you are working with the wrong image pixels,

50
00:02:21,055 --> 00:02:25,115
it kind of makes sense that before trying to determine style or age or gender,

51
00:02:25,115 --> 00:02:28,735
it is important to find where a person's head is, at least.

52
00:02:28,735 --> 00:02:32,670
In the first case, if you want to determine person's clothing style,

53
00:02:32,670 --> 00:02:35,425
it makes sense because you'll then be able to find whether

54
00:02:35,425 --> 00:02:39,130
he has some kind of jewelry which is indicative of the style.

55
00:02:39,130 --> 00:02:41,300
And the second problem, when will the face is

56
00:02:41,300 --> 00:02:43,700
helps you determine whether a person has a facial hair,

57
00:02:43,700 --> 00:02:45,710
for example, which is a telltale sign of a man,

58
00:02:45,710 --> 00:02:48,020
so far as I remember.

59
00:02:48,020 --> 00:02:51,490
And this actually means that you'll be able to train low-level features.

60
00:02:51,490 --> 00:02:54,415
A lot of low-level features for free with almost no error fittings

61
00:02:54,415 --> 00:02:58,360
long as your second domain is large enough in terms of data available.

62
00:02:58,360 --> 00:03:00,005
When training such a network,

63
00:03:00,005 --> 00:03:02,135
you'll have to feed it with mini-batches

64
00:03:02,135 --> 00:03:04,960
from both those problems in an automated fashion.

65
00:03:04,960 --> 00:03:07,220
For example, you could first sample a mini-batch

66
00:03:07,220 --> 00:03:09,785
of images from which you know age and gender,

67
00:03:09,785 --> 00:03:12,590
and train the network through its second head,

68
00:03:12,590 --> 00:03:14,720
then sample the second mini-batch which

69
00:03:14,720 --> 00:03:17,700
contains images for which you know the styling and train the first head.

70
00:03:17,700 --> 00:03:20,750
There's, of course, more than one way you can organize the training procedure.

71
00:03:20,750 --> 00:03:23,360
For example, you can take too many batches of

72
00:03:23,360 --> 00:03:26,630
data in the second domain for just one in the first one,

73
00:03:26,630 --> 00:03:30,290
or you can try to first tune your network to convergence on our age and

74
00:03:30,290 --> 00:03:34,690
gender prediction only then start fixing it on the first problem.

75
00:03:34,690 --> 00:03:36,305
We'll study this idea to

76
00:03:36,305 --> 00:03:40,285
a much greater extent in the following mood dedicated to computer vision.

77
00:03:40,285 --> 00:03:44,300
Now, the general idea is that regardless of which method you use,

78
00:03:44,300 --> 00:03:48,890
you still get this neat property that you can train your huge neural network to

79
00:03:48,890 --> 00:03:54,120
get reasonable features even though the original data is really scarce.

80
00:03:54,120 --> 00:03:57,390
So, I've just seen a few of those kind of words of power in deep learning.

81
00:03:57,390 --> 00:03:59,990
First one was about managing the level of abstraction.

82
00:03:59,990 --> 00:04:02,620
See, I want features that are much more abstract than they're

83
00:04:02,620 --> 00:04:06,680
all imaged pixels here so that I can use them together with other features.

84
00:04:06,680 --> 00:04:09,865
The second one was managing the amount of features so you can say,

85
00:04:09,865 --> 00:04:11,780
don't trust those guys too much,

86
00:04:11,780 --> 00:04:13,760
trust these guys instead.

87
00:04:13,760 --> 00:04:17,450
The final one was, could you please train features that are not only useful for

88
00:04:17,450 --> 00:04:22,025
my super small problem but also generally useful for similar problems?

89
00:04:22,025 --> 00:04:24,075
There's, of course, much more in deep learning.

90
00:04:24,075 --> 00:04:27,680
And here are some of the examples of other ideas you can

91
00:04:27,680 --> 00:04:32,595
incorporate in your neural network architecture that will appear later in our course.

92
00:04:32,595 --> 00:04:34,540
For example, if you want to solve

93
00:04:34,540 --> 00:04:38,365
an image classification problem so you want to tell cats from dogs,

94
00:04:38,365 --> 00:04:40,703
it makes a lot of sense to make the feature you

95
00:04:40,703 --> 00:04:43,870
learn invariant to the position of this cat or dog.

96
00:04:43,870 --> 00:04:47,570
Cats may appear in the middle or in the top right or bottom left corner.

97
00:04:47,570 --> 00:04:51,610
You want your model to be able to detect him regardless of where he is.

98
00:04:51,610 --> 00:04:56,735
There's also another way you can teach your neural network to be kind of robust,

99
00:04:56,735 --> 00:04:59,250
resilient to small shifts in data.

100
00:04:59,250 --> 00:05:01,410
If a cat slightly moves his paw,

101
00:05:01,410 --> 00:05:04,040
it doesn't mean it has stopped being a cat.

102
00:05:04,040 --> 00:05:06,055
For natural language applications,

103
00:05:06,055 --> 00:05:08,320
you'll learn about how do you teach

104
00:05:08,320 --> 00:05:11,345
your neural network to find the underlying cause of the data.

105
00:05:11,345 --> 00:05:14,465
Say, you have words and you want to classify the sentiment.

106
00:05:14,465 --> 00:05:17,185
Instead of working on the level of words, say, beg a force notation,

107
00:05:17,185 --> 00:05:20,250
you'll teach your neural networks to find a hidden structure,

108
00:05:20,250 --> 00:05:21,880
hidden process that generated those words,

109
00:05:21,880 --> 00:05:24,135
basically reverse-engineering human mind.

110
00:05:24,135 --> 00:05:27,770
There's also a way you can train your neural network to

111
00:05:27,770 --> 00:05:32,275
obtain some particular property of their presentations in the intermediate layers.

112
00:05:32,275 --> 00:05:35,180
For example, you may want your neural network to

113
00:05:35,180 --> 00:05:39,985
be robust in a way that it doesn't trust one single feature too much.

114
00:05:39,985 --> 00:05:44,690
Or you can try to adjust the scene representation to be sparse

115
00:05:44,690 --> 00:05:46,775
so you may train your neural networks so that

116
00:05:46,775 --> 00:05:50,215
almost all neurons eye zeros for any given object of data.

117
00:05:50,215 --> 00:05:51,970
Of course, there's much more to it.

118
00:05:51,970 --> 00:05:54,080
I just barely scratched the surface of this idea of deep learning being

119
00:05:54,080 --> 00:05:56,570
a language and as we'll go further,

120
00:05:56,570 --> 00:05:59,560
you'll study much more powerful tools to play with.

121
00:05:59,560 --> 00:06:03,705
Now, the key difference between deep learning and other machine learning methods,

122
00:06:03,705 --> 00:06:06,114
in my humble opinion, is that, well,

123
00:06:06,114 --> 00:06:09,470
in random forest you would have a few parameters that you can tweak.

124
00:06:09,470 --> 00:06:12,985
You think it actually allows you to build networks,

125
00:06:12,985 --> 00:06:17,295
build architectures in a way that actually resembles natural or programming language.

126
00:06:17,295 --> 00:06:19,550
Now of course, this language, is as of now,

127
00:06:19,550 --> 00:06:21,035
really hard to master.

128
00:06:21,035 --> 00:06:22,820
It's hard to tell what kind of architecture or

129
00:06:22,820 --> 00:06:24,995
what kind of trick fits this particular problem.

130
00:06:24,995 --> 00:06:26,535
And as in any other language,

131
00:06:26,535 --> 00:06:28,865
there's a lot of exceptions to this and you can

132
00:06:28,865 --> 00:06:32,085
just generally write down a set of rules and follow them everywhere.

133
00:06:32,085 --> 00:06:35,500
Hopefully, our course will help you to obtain some of this intuition.

134
00:06:35,500 --> 00:06:39,520
Though the main source of it is coding laps and not just listening to lectures.

135
00:06:39,520 --> 00:06:42,230
And of course, get much more proficient and resourceful if you actually

136
00:06:42,230 --> 00:06:46,230
solve the problems on your own and in this, I wish you luck.