1 00:00:04,870 --> 00:00:05,715 Our first approach 2 00:00:05,715 --> 00:00:06,500 [SOUND] 3 00:00:06,500 --> 00:00:09,348 to building a player for small single 4 00:00:09,348 --> 00:00:09,932 [SOUND] 5 00:00:09,932 --> 00:00:12,680 player games, is called compulsive deliberation. 6 00:00:14,040 --> 00:00:16,680 It's a big name for a relatively simple concept. 7 00:00:18,180 --> 00:00:20,880 In compulsive deliberation on each step the 8 00:00:20,880 --> 00:00:24,290 player examines the then current game tree 9 00:00:24,290 --> 00:00:27,520 to determine his best move for that step and at then makes the move. 10 00:00:29,110 --> 00:00:32,329 Repeats his process on the next step and so forth until the end of the game. 11 00:00:34,620 --> 00:00:36,790 Now in pure compulsive deliberation, each step 12 00:00:36,790 --> 00:00:39,100 of the computation's independent of every other step. 13 00:00:39,100 --> 00:00:43,850 No data compute, -puted during one step is accessible subsequent steps. 14 00:00:43,850 --> 00:00:46,280 The player treats each step as if it were a brand-new game. 15 00:00:47,710 --> 00:00:50,440 Now this is obviously wasteful, but it doesn't really hurt 16 00:00:50,440 --> 00:00:54,490 so long as there's enough time to do the repeated calculations. 17 00:00:55,970 --> 00:00:57,780 We start with this method because it's simple 18 00:00:57,780 --> 00:00:59,670 to understand and at the same time it 19 00:00:59,670 --> 00:01:01,110 serves as a template for the more 20 00:01:01,110 --> 00:01:04,900 sophisticated and much less wasteful methods to come. 21 00:01:07,888 --> 00:01:10,570 Using basic subroutine provided in the GGP starter 22 00:01:10,570 --> 00:01:15,090 pack, building a compulsive deliberation player is not difficult. 23 00:01:15,090 --> 00:01:17,670 The implementation looks like this. 24 00:01:17,670 --> 00:01:18,890 As shown, it's almost identical to 25 00:01:18,890 --> 00:01:20,810 our implementation of legal and random players. 26 00:01:21,850 --> 00:01:24,500 The only difference lies in the play handler. 27 00:01:24,500 --> 00:01:28,020 In selecting an action a legal player uses find legal 28 00:01:28,020 --> 00:01:31,120 x, to compute a legal move for a given state. 29 00:01:31,120 --> 00:01:33,400 In compulsive deliberation, 30 00:01:33,400 --> 00:01:35,870 the play handler instead uses a subroutine 31 00:01:35,870 --> 00:01:39,060 called bestmove that does a more sophisticated computation. 32 00:01:41,680 --> 00:01:44,690 Before looking at bestmove, let's look 33 00:01:44,690 --> 00:01:47,250 at slightly simpler subroutine called maxscore. 34 00:01:48,395 --> 00:01:52,110 Maxscore takes up state as argument, and returns the best score that 35 00:01:52,110 --> 00:01:56,640 the player can obtain by any sequence of actions in the specified state. 36 00:01:56,640 --> 00:01:57,390 Let's see how it works. 37 00:02:00,120 --> 00:02:03,490 As its first step, the procedure checks whether the given state is terminal. 38 00:02:04,770 --> 00:02:09,769 If so, then the pos, best possible score is a reward for the specified state. 39 00:02:11,260 --> 00:02:15,600 Computes this by calling the find reward subroutine on 40 00:02:15,600 --> 00:02:18,990 the goal, role, the state, and the rule set. 41 00:02:22,490 --> 00:02:26,670 If the state's not terminal, then it tries each of the actions legal in that state. 42 00:02:28,000 --> 00:02:30,830 Computes the maximum score for the state that results from 43 00:02:30,830 --> 00:02:35,140 executing that action, and returns the best score it finds. 44 00:02:36,400 --> 00:02:39,130 First step in doing this is to compute a list of all legal actions. 45 00:02:41,360 --> 00:02:47,870 It initializes the variable score to zero, then loops over the possible actions. 46 00:02:49,750 --> 00:02:52,365 Since it generally can be multiple players findnexts takes 47 00:02:52,365 --> 00:02:56,020 as arguments a list of actions of all players. 48 00:02:56,020 --> 00:02:57,760 In this case we have a single player game, 49 00:02:57,760 --> 00:02:59,860 so the player creates a list of just one element. 50 00:03:03,130 --> 00:03:04,990 The player then uses find next to compute the 51 00:03:04,990 --> 00:03:07,740 next state resulting from this move in the current state. 52 00:03:11,010 --> 00:03:13,330 Then finds the max score for that successor state. 53 00:03:13,330 --> 00:03:13,830 If 54 00:03:16,610 --> 00:03:18,810 the result is 100, then it simply returns that value, 55 00:03:18,810 --> 00:03:21,240 as there's no way to get more than 100 points. 56 00:03:21,240 --> 00:03:23,280 Otherwise, if the result is greater than current 57 00:03:23,280 --> 00:03:25,360 score, it updates the score and goes on. 58 00:03:26,580 --> 00:03:30,048 And finally, if it has not encountered a 100 value, and, in the process, it 59 00:03:30,048 --> 00:03:32,547 returns the current score, which, by construction, is 60 00:03:32,547 --> 00:03:34,790 the maximum score for, for all possible actions. 61 00:03:38,250 --> 00:03:41,330 Okay, now that we have max score, let's return to best move. 62 00:03:42,580 --> 00:03:46,520 The definition uses max score, and is actually quite similar to max score. 63 00:03:48,690 --> 00:03:50,160 There're just a few differences. 64 00:03:50,160 --> 00:03:52,110 First of all, best move is not itself 65 00:03:52,110 --> 00:03:53,919 recursive, though it calls max score, which is recursive. 66 00:03:56,510 --> 00:03:59,640 Best move does not need to check whether the state is terminal, because it 67 00:03:59,640 --> 00:04:02,800 would not be called if the gate were, game were in a terminal state. 68 00:04:02,800 --> 00:04:03,690 That's already been checked. 69 00:04:07,590 --> 00:04:11,710 It initializes a variable called action to the first legal action. 70 00:04:11,710 --> 00:04:13,200 It then behaves pretty much like max 71 00:04:13,200 --> 00:04:15,170 score, trying each possible action, to see if 72 00:04:15,170 --> 00:04:19,820 it can find one with a higher score than any previous action it has seen. 73 00:04:19,820 --> 00:04:21,670 If it ever encounters a max score 74 00:04:21,670 --> 00:04:24,040 of 100, it simply returns the corresponding action. 75 00:04:24,040 --> 00:04:27,250 Otherwise, it proceeds until it has tried all actions, at which point 76 00:04:27,250 --> 00:04:30,010 it returns the action with the highest max score that it has seen.