1
00:00:00,045 --> 00:00:04,491
[MUSIC]

2
00:00:04,491 --> 00:00:09,525
>> Let's now explore in more depth that
learning task of learning decision tree.

3
00:00:09,525 --> 00:00:12,708
So just like before we're giving
a set of training data, and

4
00:00:12,708 --> 00:00:14,830
we extract some features from that.

5
00:00:14,830 --> 00:00:18,950
And then we're going to take this
decision tree model, T of x, and

6
00:00:18,950 --> 00:00:21,630
use it to predict y hat.

7
00:00:21,630 --> 00:00:24,300
So let's explore the T of x in
a little bit more detail and

8
00:00:24,300 --> 00:00:27,420
discuss what the learning
process might look like here.

9
00:00:27,420 --> 00:00:31,270
In particular we're given a data set here,
which is a table where each row

10
00:00:31,270 --> 00:00:35,393
corresponds to a data point,
and we have our features here,

11
00:00:35,393 --> 00:00:40,810
h1 of x, h2 of x, h3 of x.

12
00:00:40,810 --> 00:00:43,980
So credit, term, and income, so in this
example it's only going to be three

13
00:00:43,980 --> 00:00:49,689
features, and then when try to predict a,
what we call, loan status.

14
00:00:50,770 --> 00:00:55,570
Which is this variable y, which is
whether the loan is risky or safe.

15
00:00:55,570 --> 00:00:59,430
And our goal is to learn the decision tree
that can help us make that prediction.

16
00:00:59,430 --> 00:01:02,870
Given one or more of these input features.

17
00:01:04,290 --> 00:01:09,830
More generally we're going to be given N
observations from our data of the form xi,

18
00:01:09,830 --> 00:01:12,150
yi where yi is the true label.

19
00:01:12,150 --> 00:01:16,390
And, we're trying to optimize some kind
of a quality metric that tries to fit

20
00:01:16,390 --> 00:01:20,880
a tree that is as good as possible
with respect to that metric.

21
00:01:20,880 --> 00:01:23,610
And, in the case of the decision
trees the metrics we're going to use

22
00:01:23,610 --> 00:01:26,220
Is just straight up classification error.

23
00:01:26,220 --> 00:01:29,110
There are other metrics that you may
use for decision trees, but just for

24
00:01:29,110 --> 00:01:32,590
this module we just want to focus
straight up on classification error.

25
00:01:32,590 --> 00:01:35,960
And classification error measures
the fraction of mistakes that we make

26
00:01:35,960 --> 00:01:38,030
when we try to make prediction
of the trailing data.

27
00:01:38,030 --> 00:01:41,460
So that's the number of incorrect
predictions divided by the total number of

28
00:01:41,460 --> 00:01:43,150
examples, just like we saw.

29
00:01:43,150 --> 00:01:45,490
In the logistic regression case.

30
00:01:45,490 --> 00:01:47,940
The best possible value here is 0,
no mistakes,

31
00:01:47,940 --> 00:01:49,450
and the worst possible value here is 1.

32
00:01:49,450 --> 00:01:53,200
I made as many mistakes
as there are data points.

33
00:01:53,200 --> 00:01:56,850
So hopefully you don't have error 1,
hopefully you have error closer to 0.

34
00:01:56,850 --> 00:01:59,747
So now that we've seen
classification error,

35
00:01:59,747 --> 00:02:02,970
we can state more clearly
where our learning goal is.

36
00:02:02,970 --> 00:02:07,140
Given the dataset, find the tree that
minimizes the classification error.

37
00:02:07,140 --> 00:02:09,830
The question is, how hard is this problem?

38
00:02:09,830 --> 00:02:12,220
How hard is this learning task?

39
00:02:12,220 --> 00:02:14,842
And it turns out that this is
an extremely hard learning task.

40
00:02:14,842 --> 00:02:18,520
There's exponentially many combination
of decision trees out there.

41
00:02:18,520 --> 00:02:21,500
Even if you consider just
one branch of decision tree.

42
00:02:21,500 --> 00:02:26,180
Selecting what features to come next in
an order can be a really hard problem and

43
00:02:26,180 --> 00:02:28,500
lead to explanation in many combinations.

44
00:02:28,500 --> 00:02:29,020
And in fact, for

45
00:02:29,020 --> 00:02:33,440
those familiar with this, this is an
example what's called an NP hard problem.

46
00:02:33,440 --> 00:02:37,840
So, it's a really hard problem and
there are no approximation algorithms for

47
00:02:37,840 --> 00:02:40,820
this problem with guarantees, but
there is some simple ideas of heuristics

48
00:02:40,820 --> 00:02:42,808
that tend to perform
extremely well in practice.

49
00:02:42,808 --> 00:02:45,705
And we're going to talk
about one such idea today,

50
00:02:45,705 --> 00:02:49,380
Wwhich is called the greedy,
simple greedy algorithm, basically.

51
00:02:49,380 --> 00:02:53,023
And we're going to incrementally
build a tree, one layer at a time,

52
00:02:53,023 --> 00:02:56,809
in order to get the best possible
classification error at each step.

53
00:02:56,809 --> 00:02:57,309
>> [MUSIC]