1
00:00:03,830 --> 00:00:07,200
Welcome to week six. In this week,

2
00:00:07,200 --> 00:00:10,100
we'll talk about non-parametric methods

3
00:00:10,100 --> 00:00:13,985
and especially interesting would be the Gaussian process.

4
00:00:13,985 --> 00:00:17,715
Let's start with seeing what are non-parametric methods.

5
00:00:17,715 --> 00:00:20,035
We all know what parametric methods are.

6
00:00:20,035 --> 00:00:22,105
We define a model that depends on

7
00:00:22,105 --> 00:00:25,830
some parametrics theta and then we find optimal values for theta by

8
00:00:25,830 --> 00:00:32,580
maximizing the maximum a posteriori or maybe taking the maximi destination.

9
00:00:32,580 --> 00:00:35,875
We can fit a linear model for our data.

10
00:00:35,875 --> 00:00:38,395
We'll have parameters A and B, in this case.

11
00:00:38,395 --> 00:00:41,250
If the data becomes more complex,

12
00:00:41,250 --> 00:00:42,600
we can add more parameters.

13
00:00:42,600 --> 00:00:44,520
For example, we can fit a parabola.

14
00:00:44,520 --> 00:00:48,033
In this case, we'll have three parameters,

15
00:00:48,033 --> 00:00:50,640
but as data becomes more and more complex,

16
00:00:50,640 --> 00:00:53,585
we need to add more and more parameters.

17
00:00:53,585 --> 00:00:55,935
For example, in this case, I had to add

18
00:00:55,935 --> 00:00:59,810
eight parameters and to fit a paranorminal of eighth degree.

19
00:00:59,810 --> 00:01:04,655
Another case are non-parametric methods.

20
00:01:04,655 --> 00:01:10,110
Non-parametric methods are the number of parameters depend on the data set size.

21
00:01:10,110 --> 00:01:14,140
That is, as the number of data points increases,

22
00:01:14,140 --> 00:01:17,250
the decision boundary becomes more and more complex.

23
00:01:17,250 --> 00:01:19,525
In parametric methods, though,

24
00:01:19,525 --> 00:01:25,490
the number of parameters are fixed and it doesn't matter how much data do we have.

25
00:01:25,490 --> 00:01:32,665
One non-parametric method that you should know is K-nearest neighbors.

26
00:01:32,665 --> 00:01:36,545
In this case, we do a prediction by finding K,

27
00:01:36,545 --> 00:01:41,920
in this case five nearest neighbors depending on x,

28
00:01:41,920 --> 00:01:48,060
and then the prediction will be the average of the target values for neighboring points.

29
00:01:48,060 --> 00:01:50,620
This would be, for example, 1/5,

30
00:01:50,620 --> 00:01:54,520
sum over the targets of the five nearest neighbors for

31
00:01:54,520 --> 00:01:59,895
the points and the prediction would look as the red curve here.

32
00:01:59,895 --> 00:02:03,510
It isn't smooth. To make it smoother,

33
00:02:03,510 --> 00:02:07,295
we could use circles and Nadaraya-Watson regression.

34
00:02:07,295 --> 00:02:12,594
In this case, we weigh the points by the distance.

35
00:02:12,594 --> 00:02:16,300
I would like to say that the point that are really neighbor to

36
00:02:16,300 --> 00:02:22,980
our point have the highest weight and the points that are far away have lower weights.

37
00:02:22,980 --> 00:02:24,670
This can be written as follows.

38
00:02:24,670 --> 00:02:27,235
It is Y of X, prediction at the point X

39
00:02:27,235 --> 00:02:31,820
equals to the sum over all points in our data set.

40
00:02:31,820 --> 00:02:39,418
WI of X, this is the weight of the point times the YI,

41
00:02:39,418 --> 00:02:42,015
the target of the highest point.

42
00:02:42,015 --> 00:02:47,290
The weight can be computed as the kernel function of X,

43
00:02:47,290 --> 00:02:50,166
the point where we want to predict and XI,

44
00:02:50,166 --> 00:02:54,655
where XI is the position of the highest point.

45
00:02:54,655 --> 00:02:58,060
We should also ensure that the weight sum up to 1 and so we have to

46
00:02:58,060 --> 00:03:02,760
divide the kernel by the sum over all points.

47
00:03:02,760 --> 00:03:06,310
We can see different rows for the kernels.

48
00:03:06,310 --> 00:03:09,280
The most popular one is so-called Gaussian kernel.

49
00:03:09,280 --> 00:03:12,788
In Gaussian kernel, we measure the similarity,

50
00:03:12,788 --> 00:03:15,005
the kernel of X1 and X2,

51
00:03:15,005 --> 00:03:19,120
as the exponent of minus one or two sigma squared,

52
00:03:19,120 --> 00:03:21,880
where sigma is a parameter of the kernel,

53
00:03:21,880 --> 00:03:26,950
times the squared distance between the points and so the plots would look like this.

54
00:03:26,950 --> 00:03:30,190
If we take a higher value of sigma,

55
00:03:30,190 --> 00:03:34,680
the values would drop slower.

56
00:03:34,680 --> 00:03:40,530
You would weight further points a bit higher

57
00:03:40,530 --> 00:03:47,430
and if sigma is low then the kernel would quickly drop to zero.

58
00:03:47,430 --> 00:03:50,200
We could also use a uniform kernel.

59
00:03:50,200 --> 00:03:52,206
For the uniform kernel,

60
00:03:52,206 --> 00:03:56,030
we equally weight the points that have

61
00:03:56,030 --> 00:04:00,160
distance at most age. So, you'll see what we've seen.

62
00:04:00,160 --> 00:04:04,393
There are parametric methods and non-parametric methods.

63
00:04:04,393 --> 00:04:06,730
In parametric methods, you have some fixed number of

64
00:04:06,730 --> 00:04:10,810
parameters and so the complexity is limited.

65
00:04:10,810 --> 00:04:13,720
However, in non-parametric methods,

66
00:04:13,720 --> 00:04:18,630
the decision boundary becomes more and more complex as number of data points increases.

67
00:04:18,630 --> 00:04:21,168
We can say that roughly speaking,

68
00:04:21,168 --> 00:04:26,440
the model is going to be obviously complex as the number of data points goes to infinity.

69
00:04:26,440 --> 00:04:30,190
For parametric methods, the inference is quite fast.

70
00:04:30,190 --> 00:04:32,290
So, for example, for linear regression if you

71
00:04:32,290 --> 00:04:34,660
feed the weights then the prediction would be

72
00:04:34,660 --> 00:04:40,140
just the scalar multiplication between the new point and the weight vector.

73
00:04:40,140 --> 00:04:42,310
For non-parametric method, though,

74
00:04:42,310 --> 00:04:45,875
you'll have to process all the data points to make a prediction.

75
00:04:45,875 --> 00:04:49,510
For example, in Nadaraya-Watson algorithm you had to compute

76
00:04:49,510 --> 00:04:53,570
the weights for all the data points.

77
00:04:53,570 --> 00:04:58,450
In parametric methods, the loading is sometimes slow.

78
00:04:58,450 --> 00:05:01,660
For example, training a neural network may take you hours or

79
00:05:01,660 --> 00:05:05,324
even days while for non traditional methods,

80
00:05:05,324 --> 00:05:09,210
the training is usually just remembering all the data.

81
00:05:09,210 --> 00:05:13,960
It is actually a bit not true since for k-nearest neighbors,

82
00:05:13,960 --> 00:05:18,650
for example, we can pre-computer some information to make a prediction a bit faster.

83
00:05:18,650 --> 00:05:21,605
However, for most cases this is just the case,

84
00:05:21,605 --> 00:05:24,670
you have to simply remember all the data.

85
00:05:24,670 --> 00:05:28,282
So, we've seen what non-parametric methods are,

86
00:05:28,282 --> 00:05:34,665
and the one method that we're particularly interested in are Gaussian process.

87
00:05:34,665 --> 00:05:37,270
Those are aggression models that

88
00:05:37,270 --> 00:05:39,790
we'll be able to estimate uncertainty of the predictions,

89
00:05:39,790 --> 00:05:45,255
which is a very desirable property for complications like medicine.

90
00:05:45,255 --> 00:05:49,260
We'll see in the next video what the Gaussian process are.