1
00:00:04,660 --> 00:00:09,840
Monte Carlo Tree Search is a more
sophisticated variation of Monte

2
00:00:09,840 --> 00:00:12,870
Carlo Search, that tackles some of the
weaknesses of the simpler method.

3
00:00:14,600 --> 00:00:16,740
both methods build up a game tree
incrementally,

4
00:00:16,740 --> 00:00:19,520
and both rely on random simulation of
games.

5
00:00:19,520 --> 00:00:21,700
But they differ in the way the tree is
expanded.

6
00:00:22,870 --> 00:00:27,040
MCS, Monte Carlo Search uniformly expands
the partial game tree during

7
00:00:27,040 --> 00:00:30,700
it's expansion phase, and then simulates
games starting at the state,

8
00:00:30,700 --> 00:00:32,260
states on the fringe of the expanded tree.

9
00:00:32,260 --> 00:00:36,280
MCTS, Monte Carlo Tree Search, uses a more
sophisticated approach, in which

10
00:00:36,280 --> 00:00:40,186
the, the processes of expansion in the
simulation are kind of interleaved.

11
00:00:43,550 --> 00:00:47,350
MCTS processes the game tree in cycles of
4 steps each.

12
00:00:47,350 --> 00:00:48,910
And after each cycle's complete, it
repeats

13
00:00:48,910 --> 00:00:50,840
the steps, so long as there's time
remaining.

14
00:00:50,840 --> 00:00:52,300
At which point, it selects an action based

15
00:00:52,300 --> 00:00:54,410
on the statistics its accumulated to that
point.

16
00:00:56,120 --> 00:00:59,290
On the selection step, the player
traverses the tree produced

17
00:00:59,290 --> 00:01:02,620
thus far, to select an unexpanded node of
the tree.

18
00:01:02,620 --> 00:01:06,300
Making choices based on visit counts and
utilities stored on nodes of the tree.

19
00:01:06,300 --> 00:01:09,090
We'll see how that

20
00:01:09,090 --> 00:01:10,900
happens a little bit later.

21
00:01:10,900 --> 00:01:13,240
During expansion, the successes of the
state chosen

22
00:01:13,240 --> 00:01:15,010
during the selection phase are added to
the tree.

23
00:01:16,040 --> 00:01:18,830
The player then simulates the game
starting

24
00:01:18,830 --> 00:01:21,010
at the node chosen during the selection
phase.

25
00:01:21,010 --> 00:01:22,700
In so doing, it chooses actions at random

26
00:01:22,700 --> 00:01:24,900
until a terminal state is encountered, as
with MCS.

27
00:01:24,900 --> 00:01:29,780
And finally, the value of the terminal
state is propagated back along the

28
00:01:29,780 --> 00:01:31,340
path to the root node, and the

29
00:01:31,340 --> 00:01:33,670
visit counts and utilities are updated
accordingly.

30
00:01:35,950 --> 00:01:39,510
Here's an implementation of the MCTS
selection procedure.

31
00:01:39,510 --> 00:01:41,984
If the initial state has not been seen
before, not

32
00:01:41,984 --> 00:01:45,343
with zero visits, has not been seen
before, then hit selected.

33
00:01:46,902 --> 00:01:49,840
Otherwise, a procedure search is the
successors of the node.

34
00:01:49,840 --> 00:01:53,210
If any of them have not been seen, then
one of the unseen nodes is selected.

35
00:01:53,210 --> 00:01:55,420
If all of the successors have been seen
before, then

36
00:01:55,420 --> 00:01:59,050
procedure uses the select fin subroutine,
which we'll talk about.

37
00:01:59,050 --> 00:02:01,080
The fine values for those nodes, and
chooses

38
00:02:01,080 --> 00:02:02,900
the one that maximizes this value.

39
00:02:05,090 --> 00:02:07,760
Okay, one of the most common ways of
implementing selection fun, is

40
00:02:07,760 --> 00:02:14,690
what's called UCT which is short for upper
confidence bounds applied to trees.

41
00:02:14,690 --> 00:02:17,820
A typical UCT formula is shown here.

42
00:02:17,820 --> 00:02:21,530
Vi plus the square root of log n p over
ni.

43
00:02:21,530 --> 00:02:25,700
Vi's the average reward for that state,
that it's seen so far.

44
00:02:25,700 --> 00:02:29,990
Np is the total number of times the
parent's, state's parent was picked.

45
00:02:29,990 --> 00:02:32,830
Ni is the number of times this particular
state was picked.

46
00:02:34,320 --> 00:02:37,255
Course, there are other ways that, that
one can evaluate states.

47
00:02:37,255 --> 00:02:41,590
The form here is based on combination of
what's called exploitation exploration.

48
00:02:41,590 --> 00:02:45,190
Exploitation here means that results use
of results on

49
00:02:45,190 --> 00:02:49,560
previously explored states, which is the
first term, vi.

50
00:02:49,560 --> 00:02:53,160
Exploration means expansion of as yet
unexplored states,

51
00:02:53,160 --> 00:02:55,500
a measure of which is the second term.

52
00:02:56,970 --> 00:02:58,180
And at the bottom of the slide, we have

53
00:02:58,180 --> 00:03:01,379
simply implementation of the formula shown
at the top.

54
00:03:04,630 --> 00:03:07,390
Expansion in MCTS is basically the same as
that for MCS.

55
00:03:07,390 --> 00:03:09,861
An implementation for a single player is
shown here.

56
00:03:09,861 --> 00:03:12,511
On large games with large time bounds,
it's possible that the

57
00:03:12,511 --> 00:03:15,660
space consumed as process could exceed the
memory available to a player.

58
00:03:15,660 --> 00:03:18,560
In such cases, it's common to use a
variation of the selection procedure, in

59
00:03:18,560 --> 00:03:21,779
which no additional states are added to
the tree, and just probes are used.

60
00:03:25,930 --> 00:03:29,960
Simulation for MCTS is essentially the
same as simulation for MCS.

61
00:03:29,960 --> 00:03:33,290
So the same procedure, exact procedure can
be used in both methods.

62
00:03:35,640 --> 00:03:38,270
And MCTS however, has a different

63
00:03:38,270 --> 00:03:41,420
procedure for recording the results,
called backpropagation.

64
00:03:42,790 --> 00:03:48,190
At the selected node, the method records a
visit count and a utility.

65
00:03:48,190 --> 00:03:52,993
A visit count in this case is one, since
it, it's a newly-processed state.

66
00:03:52,993 --> 00:03:55,370
The utility is the result of the
simulation.

67
00:03:56,756 --> 00:03:59,419
the procedure then propagates to ancestors
of this node.

68
00:03:59,419 --> 00:04:00,875
In the case of a single-player

69
00:04:00,875 --> 00:04:04,347
game, the procedure simply adds one to the
visit count of each ancestor,

70
00:04:04,347 --> 00:04:09,000
and it augments its total utility by the
utility obtained on the latest simulation.

71
00:04:09,000 --> 00:04:11,340
In the case of a multi-player game, the
propagated value

72
00:04:11,340 --> 00:04:14,174
is the minimum of the values for all
opponent actions.