1
00:00:03,050 --> 00:00:08,067
So, course structure.
There's going to be recommended readings.

2
00:00:09,011 --> 00:00:13,010
There is going to be in-video or
in-lecture questions that will pop up.

3
00:00:13,010 --> 00:00:18,003
There is going to be several problem sets
during the term, and these are going to be

4
00:00:18,003 --> 00:00:20,079
very useful for review for exam
preparation.

5
00:00:20,079 --> 00:00:25,036
So, I'll give you guys a hint right now,
that, if you do the problem sets, if you

6
00:00:25,036 --> 00:00:29,076
actually master the problem sets, the
exams are going to be relatively easy

7
00:00:29,076 --> 00:00:32,070
after that.
We'll probably use pure evaluation for

8
00:00:32,070 --> 00:00:37,004
grading the problem sets because lot of
them are more open-ended problems.

9
00:00:37,004 --> 00:00:39,085
And we are going to have a mid-term and
final exam.

10
00:00:40,039 --> 00:00:46,087
One, one other thing I wanted to point out
is collaboration in this class is

11
00:00:46,087 --> 00:00:52,638
encouraged, but I want everyone to make
their own problem sets in mid-terms and

12
00:00:52,638 --> 00:00:57,060
final exams.
So, you know, you can discuss the overall

13
00:00:57,060 --> 00:01:04,007
generalities of the ideas and the concepts
going on, but I don't want people

14
00:01:04,007 --> 00:01:07,063
discussing the actual exam questions in
particular.

15
00:01:07,063 --> 00:01:12,059
So, for instance, you can discuss the
concept if you have some caching question

16
00:01:12,059 --> 00:01:16,281
and you want to understand how caches
really work, you know, discuss the

17
00:01:16,281 --> 00:01:21,153
concepts and collaborate on that but don't
discuss and collaborate on the actual

18
00:01:21,153 --> 00:01:25,095
problem itself, on the respective problem
sets, mid-terms and final exams.

19
00:01:27,070 --> 00:01:30,094
Okay, so, let's, let's talk about the
content of this course.

20
00:01:30,094 --> 00:01:34,092
So, we, we have a very high level
motivation and now we're going to talk

21
00:01:34,092 --> 00:01:39,018
about what's inside of this course.
And I'm going to start of by contrasting

22
00:01:39,018 --> 00:01:41,070
it with what you should have already
learned.

23
00:01:41,070 --> 00:01:47,061
So, in a computer organization class,
something like ELE 475 at Princeton,

24
00:01:47,061 --> 00:01:52,004
you're going to have learned how to build
a basic processor.

25
00:01:52,004 --> 00:01:57,014
So, something like we see here.
This was the, this is actually the Risk

26
00:01:57,014 --> 00:02:03,035
one processor from Berkeley.
Depending on who you ask, either the Risk

27
00:02:03,035 --> 00:02:09,086
one or the, the first Nips chip was sort
of the first academic Risk.

28
00:02:09,086 --> 00:02:17,025
The IBM 801 probably used a lot of those
ideas but didn't call it Risk before then.

29
00:02:17,025 --> 00:02:22,157
But, you know, you, you learned how to
design stuff that had about 50,000

30
00:02:22,157 --> 00:02:26,012
transistors.
So, this entire design here, this is a, a,

31
00:02:26,012 --> 00:02:31,039
a two-stage pipe line processor.
But things that you should of learned is,

32
00:02:31,039 --> 00:02:36,098
basic cache ideas, pipe lining.
So, how do you pipeline a processor, a

33
00:02:36,098 --> 00:02:42,046
little bit about memory systems.
And, and, you suppose to know sort of how

34
00:02:42,046 --> 00:02:50,044
logic works or digital logic works.
And then, in this class, to contrast,

35
00:02:50,044 --> 00:02:56,409
instead of learning how to build a very
simplistic processor, we're going to learn

36
00:02:56,409 --> 00:02:59,783
how to build cutting edge modern day
microprocessors.

37
00:02:59,783 --> 00:03:04,925
That's right, we're going to learn how to
build things like this, or at least design

38
00:03:04,925 --> 00:03:08,878
things like this.
So, this is a Core I7 from Intel.

39
00:03:08,878 --> 00:03:16,002
We, I guess, this is an original Core I7,
we're now in the third generation of Core

40
00:03:16,002 --> 00:03:20,593
I7's standing in 2012 now.
So, this is, this is pretty, pretty

41
00:03:20,593 --> 00:03:24,062
recent.
And to give you an idea, to contract in

42
00:03:24,062 --> 00:03:30,338
that previous picture, which was 50,000
transistors, this design is about 700

43
00:03:30,338 --> 00:03:36,044
million transistors.
So, the complexity has gone up here a lot.

44
00:03:36,044 --> 00:03:40,080
That other processor, or the processors
that you learned about in your computer

45
00:03:40,080 --> 00:03:43,071
organization class, there's a tiny little
box up here.

46
00:03:44,003 --> 00:03:48,060
And have performance that's sort of
equivalent to the size of the little tiny

47
00:03:48,060 --> 00:03:51,034
box relative to these, these, this big
processor.

48
00:03:51,034 --> 00:03:55,096
So, we're gonna learn how to, instead of
just building little tiny processors or

49
00:03:55,096 --> 00:04:00,041
toy processors, we're going to learn about
how to build big processors and high

50
00:04:00,041 --> 00:04:06,097
performance processors.
So, before I go down this list, I want to

51
00:04:06,097 --> 00:04:13,011
talk briefly about the course content of
ELE 475 and the two main techniques to

52
00:04:13,011 --> 00:04:17,020
make processors go fast.
So, how do we, how do we go about making

53
00:04:17,020 --> 00:04:22,669
processors go, go fast cuz people like
their computing systems to run, run fast.

54
00:04:22,669 --> 00:04:25,921
Well, one is to exploit parallels,
parallelism.

55
00:04:25,921 --> 00:04:31,850
So, we're going to figure out how to
exploit lots of concurrent transistors, or

56
00:04:31,850 --> 00:04:37,348
concurrent parallelism in your program,
and as you add more transitions or more

57
00:04:37,348 --> 00:04:40,104
parallelism.
Hopefully, it will make your computing

58
00:04:40,104 --> 00:04:42,849
system go faster.
So there, and there's different techniques

59
00:04:42,849 --> 00:04:46,071
on how to go after parallelism and they're
not all explicit parallelism.

60
00:04:46,071 --> 00:04:49,926
So a lot of them are implicit parallelism.
So, for instance, instructionable

61
00:04:49,926 --> 00:04:52,435
parallelism is a completely implicit
concept.

62
00:04:52,435 --> 00:04:54,609
The programmer doesn't have to do
anything.

63
00:04:54,609 --> 00:04:59,410
And then, the other main technique we can
think about is just to do less work.

64
00:04:59,410 --> 00:05:04,007
So, if you're trying to do something and
you look at let's say, an assembly line of

65
00:05:04,007 --> 00:05:07,603
someone building cars, well, you can
either pipeline, and try to get pipeline

66
00:05:07,603 --> 00:05:12,042
parallels I mean, your assembly system or
you can try to have multiple people

67
00:05:12,042 --> 00:05:16,061
building different cars at the same time.
So, this all falls in the parallelism

68
00:05:16,061 --> 00:05:19,030
category.
There's something else you can do if you

69
00:05:19,030 --> 00:05:23,060
want to make a car faster is you just take
out steps or you take out components.

70
00:05:23,060 --> 00:05:28,097
So, you do less work.
And one way to do less work, is, to have

71
00:05:28,097 --> 00:05:34,345
fancier software systems.
So, we can have better compilers and

72
00:05:34,345 --> 00:05:38,055
runtime systems.
And a lot of times, they can remove work.

73
00:05:38,055 --> 00:05:41,094
So, this is like the optimization pass in
your compiler.

74
00:05:41,094 --> 00:05:46,081
If you turn on -03 or the optimization for
GCC, it's going to try to remove

75
00:05:46,081 --> 00:05:51,087
instructions from your program, which are
either redundant or not doing any useful

76
00:05:51,087 --> 00:05:54,493
work.
Another great example of this, which

77
00:05:54,493 --> 00:06:00,535
people don't really think about as doing
less work, but actually is, is something

78
00:06:00,535 --> 00:06:05,871
like a cache in your microprocessor.
So, in your cache, it puts memory closer

79
00:06:05,871 --> 00:06:13,026
to the processor than main memory.
Well, this is equivalent to, if you had an

80
00:06:13,026 --> 00:06:18,270
assembly system or, a, a, a production
line of cars, and let's say, for every

81
00:06:18,270 --> 00:06:21,713
part you had to get, you had to walk down
the street three blocks, get the part, and

82
00:06:21,713 --> 00:06:24,061
bring it close.
Well, that's, that's pretty slow.

83
00:06:24,061 --> 00:06:28,027
It's doing a lot of work for each part
that you need to go fetch.

84
00:06:28,027 --> 00:06:31,893
But, in a cache, you can actually put the
data very close and by doing that, or put,

85
00:06:31,893 --> 00:06:37,181
put the parts very close, similar sorts of
ideas here and car assembly is you can put

86
00:06:37,181 --> 00:06:41,218
a bin, if you will, of all of the parts
you need to build the car and then just

87
00:06:41,218 --> 00:06:45,018
grab out of that bin.
You're going to do less work, you'll do

88
00:06:45,018 --> 00:06:47,095
less walking.
Similar sorts of things with caches.

89
00:06:47,095 --> 00:06:51,084
So, these are the two primary techniques
that we're going to apply.

90
00:06:51,084 --> 00:06:54,935
So now, let's dive into the actual
technical content of, of what we're going

91
00:06:54,935 --> 00:07:00,002
to learn in Computer Architecture, in this
Computer Architecture class.

92
00:07:00,002 --> 00:07:04,051
And we'll categorize them as either doing
less work or parallelism.

93
00:07:04,051 --> 00:07:08,709
So, the first, the, this, this, the first
thing we're going to start off in this

94
00:07:08,709 --> 00:07:14,045
class talking about is we're going to talk
about instruction level parallelism.

95
00:07:14,045 --> 00:07:19,013
So, we're going to look at superscalar
processors, which can execute multiple

96
00:07:19,013 --> 00:07:23,087
instructions at the same time.
And it's done implicitly from sequential

97
00:07:23,087 --> 00:07:26,095
code.
And we're also going to study very long

98
00:07:26,095 --> 00:07:30,466
instruction word processors or what's
called VLIW processors.

99
00:07:30,466 --> 00:07:35,808
We're going to hint a little bit about
pipeline parallelism and look at how to

100
00:07:35,808 --> 00:07:41,099
build long, longish pipeline processors.
We'll talk about advanced memory and cache

101
00:07:41,099 --> 00:07:44,074
systems.
So, this has no parallelism in the word

102
00:07:44,074 --> 00:07:48,239
here, in the title here.
So, what this is going to be, is this is

103
00:07:48,239 --> 00:07:52,693
going to be looking at doing less work.
And we're going to look how you build

104
00:07:52,693 --> 00:07:57,711
memory systems that either bring the data
closer or have higher bandwidth, and, and

105
00:07:57,711 --> 00:08:03,036
a lot of the implementation issues in
building these advanced memory systems.

106
00:08:03,089 --> 00:08:08,085
Then, as the term goes on, we're going to
be talking about data level parallelism.

107
00:08:08,085 --> 00:08:11,075
So, this is more explicit levels of
parallelism.

108
00:08:11,075 --> 00:08:16,070
So, being, these are things like vector
computers and graphics processor units, or

109
00:08:16,070 --> 00:08:18,639
general purpose graphics processor units,
GPGPUs.

110
00:08:20,018 --> 00:08:24,095
And at the end of the course, we're going
to talk about explicit threaded

111
00:08:24,095 --> 00:08:28,034
parallelism.
And we'll be talking about multithreading,

112
00:08:28,034 --> 00:08:33,030
how do you build multiprocessor system so
this is multiple chip, multiprocessor

113
00:08:33,030 --> 00:08:38,020
systems, multicore and many core systems
and how do you interconnect all these

114
00:08:38,020 --> 00:08:42,040
different processors.
Roughly, the first third of the course is

115
00:08:42,040 --> 00:08:45,089
going to be talking about
construction-level parallelism.

116
00:08:45,089 --> 00:08:50,055
There's going to be sort of a middle
third, which is going to talk about caches

117
00:08:50,055 --> 00:08:55,034
and little about data level parallelism,
and then the last third is going to talk

118
00:08:55,034 --> 00:09:00,055
about more threaded levels of parallelism.
But that's a very coarse cut of this