DSPRelated.com
Forums

1 instruction / nanosecond plus a nanosecond for on-chip cache access,

Started by anon July 6, 2006
<cpu16x1832@wmconnect.com> wrote in message 
news:1152460052.205210.205600@s13g2000cwa.googlegroups.com...

> Pipelining DOESN'T reduce execution time of individual instructions, it > statistically increases it!
No, pipelining reduces the average execution time of instructions by reducing the number of logic levels necessary in each stage. This results in a higher maximum clock frequency, and thus the execution time of instructions is reduced. A drawback is that the average number of cycles per instruction increases as the pipeline becomes deeper.
> Pipelining DOES increase THE RATE at which INSTRUCTION STREAMS complete > execution, however, NO SINGLE INSTRUCTION RUNS FASTER!
If you mean the total time an instruction spends in the pipeline, then yes, that is true. But that is not the same as the execution time which is only a tiny proportion of the total.
> My example urls, at the top of my request for information post, > reference parallel processors and may execute DOZENS of INSTRUCTION > STREAMS SIMULTANEOUSLY!!!
And why is that good? Parallel computers have existed for many decades, and chips with multiple cores are not exactly new either. The problem is that apart from a few exceptions it is extremely difficult to program them and get anywhere near peak performance. This is why people look for the fastest CPUs available first and only then look at using them in parallel.
> DOZENS of Ghz ( in parrallel) is SIGNIFICANTLY GREATER than 1 GHz stuck > inside an ancient pipeline architecture.
No it's not. If you're talking about SeaForth then you need to realise it is a very simplistic instruction set and likely an even more simplistic micro architecture. It will need many many more instructions and cycles to do the same amount of work as existing architectures. So even if it did reach GHz speeds (which it can't if it isn't pipelined), it wouldn't get anywhere near the performance of those "ancient" pipelined CPUs. Clock frequency does NOT equal performance... Wilco
Wilco Dijkstra wrote:
> No it's not. If you're talking about SeaForth then you need to realise > it is a very simplistic instruction set and likely an even more simplistic > micro architecture. It will need many many more instructions and cycles > to do the same amount of work as existing architectures.
It is the old RISC idea, except instead of making 8x as much code with big RISC opcodes compared to CISC opcodes we get signficantly less because of the tiny zero operand opcodes that run so much faster than memory without pipelines. For some things we need fewer instructions than other processors and for other things we need more, but our opcodes are so tiny that our code is often among the smallest. We know what code works.
> So even if it did reach GHz speeds (which it can't if it isn't pipelined),
Pipelining is a form of parallelism, but staggered. It takes a lot of transistors and gets little return. Best case gives several times speedup and worst case is many times slowdown. Interrupts are notorious for instance in causing catastrophic pipeline stalls and cache misses. It makes chips hot and expensive and harder to program, and programs get bigger, and it reduces production yield. The point here is getting up to a 1GHz instruction exection at the mininal cost in transistors by not using pipelining or caches, making a classs of programs smaller, faster, and simpler as odd as the might sound. The idea is you pay for transistors. You can get a many millions to execute a billion pipelined instructions per second on one processor with parallel pipelines, or you can have a large number of non-pipelined processors runing at that same speed perhaps thousands of them.
> it wouldn't get anywhere near the performance of those "ancient" > pipelined CPUs. Clock frequency does NOT equal performance...
Clock frequency does not equal performance. But I have seen a lot more profiled code on these things than most people. And we make decisions based on what real code can do, not just guesses. There are things where we can't compete with big desktop CPU and don't intend to. We can also use processors to respond to events much faster than is possible with any interrupt driven code. Sometimes thousands of times or millions of times faster than big pipelined cached interrupted processors can respond to external events. We don't expect to see any big expensive pipelined designs compete in what we will do.either.
In comp.arch Wilco Dijkstra <Wilco_dot_Dijkstra@ntlworld.com> wrote:

[...]


I admit I thought this thread was about 1 instruction / nanosecond (as labelled)
and not 1 GIPS/$. :-}


> > My example urls, at the top of my request for information post, > > reference parallel processors and may execute DOZENS of INSTRUCTION > > STREAMS SIMULTANEOUSLY!!! > And why is that good? Parallel computers have existed for many > decades, and chips with multiple cores are not exactly new either. > The problem is that apart from a few exceptions it is extremely > difficult to program them and get anywhere near peak performance. > This is why people look for the fastest CPUs available first and > only then look at using them in parallel.
... Maybe this drags things back to the reputed thread subject ;-) . Instead of getting the highest performance at any price, the idea behind cluster design is to get the best performance per $ and scale up to your budget. For some years I've been running a small business selling computer time on essentially home-built clusters. Obviously (?) I don't make my own chips or even have the patience to play with FPGA's. So getting to $US1 per GIPS (although in my neck of the woods we'd be talking about FLOPS) is a few years off, yet. But maximizing bang per buck is certainly in the $US10-$100 per FLOPS range when building 100s crunch nodes at a time. I.e. old 1 GHZ chips, old mobos, old SDRAM's, and well-designed s/w using SSE{1,2}. The "exact opposite" (modulo engineering mentality) of going high-performance/bleeding edge.
"russell kym horsell" <kym@ukato.freeshell.org> wrote in message 
news:e8sj7g$2rt$1@chessie.cirr.com...
> In comp.arch Wilco Dijkstra <Wilco_dot_Dijkstra@ntlworld.com> wrote:
>> > My example urls, at the top of my request for information post, >> > reference parallel processors and may execute DOZENS of INSTRUCTION >> > STREAMS SIMULTANEOUSLY!!! >> And why is that good? Parallel computers have existed for many >> decades, and chips with multiple cores are not exactly new either. >> The problem is that apart from a few exceptions it is extremely >> difficult to program them and get anywhere near peak performance. >> This is why people look for the fastest CPUs available first and >> only then look at using them in parallel. > ... > > Maybe this drags things back to the reputed thread subject ;-) . > Instead of getting the highest performance at any price, the idea > behind cluster design is to get the best performance per $ and scale > up to your budget.
Yes, it all depends on your goals and how much money you're willing to spend. You get the best performance per dollar indeed by going for the low end of desktop CPUs rather than paying 4 times as much for only 20% performance. That is very different from what the OP proposed though.
> For some years I've been running a small business selling computer time > on essentially home-built clusters. Obviously (?) I don't make my own > chips or even have the patience to play with FPGA's. So getting to > $US1 per GIPS (although in my neck of the woods we'd be talking about > FLOPS) is a few years off, yet. But maximizing bang per buck is > certainly in the $US10-$100 per FLOPS range when building 100s crunch > nodes > at a time. I.e. old 1 GHZ chips, old mobos, old SDRAM's, and well-designed > s/w using SSE{1,2}. The "exact opposite" (modulo engineering mentality) > of going high-performance/bleeding edge.
$10 per GFLOPS sounds impossible, where do you get that number from? $100 per GFLOPS is achievable today using low-end x86 or high-end embedded CPUs (eg. MPCore has 4 550MHz ARM11 cores on a single chip, providing over 4GFLOPS). Floating point DSPs are too expensive... So $1 per GFLOPS seems a long way off, at least 5, perhaps 10 years. Wilco
<fox@ultratechnology.com> wrote in message
news:1152496187.946828.66850@75g2000cwc.googlegroups.com...
> Wilco Dijkstra wrote: >> No it's not. If you're talking about SeaForth then you need to realise >> it is a very simplistic instruction set and likely an even more >> simplistic >> micro architecture. It will need many many more instructions and cycles >> to do the same amount of work as existing architectures. > > It is the old RISC idea, except instead of making 8x as much > code with big RISC opcodes compared to CISC opcodes we get
Eh, where did you get 8x from? The codesize bloat of a pure RISC is not nearly that bad (not even Itanium comes close). Modern RISCs like ARM easily beat CISCs on codesize.
> signficantly less because of the tiny zero operand opcodes that run so > much faster than memory without pipelines. For some things we need > fewer instructions than other processors and for other things we need > more, but our opcodes are so tiny that our code is often among the > smallest. We know what code works.
Sounds unlikely. A quick glance at the instruction set shows that it doesn't support common operations such as multiplies and shifts. A shift by 16 bits would need 16 instructions (80 bits) and take 16 cycles. You could use subroutines of course, but even then you need at least 15 bits for the call and it takes even more cycles. Other trivial things like addition (eg. x = y + z) take at least 10 instructions (80 bits) compared to just one 16 or 32-bit instruction on a typical RISC. It doesn't look very good...
>> So even if it did reach GHz speeds (which it can't if it isn't >> pipelined), > > Pipelining is a form of parallelism, but staggered. It takes a lot of > transistors and gets little return.
It doesn't need a lot of transistors as long as you don't pipeline like the P4. A 5-stage pipeline typically provides a 4 fold speedup so it is more than worth it.
>Best case gives several times > speedup and worst case is many times slowdown.
That's rubbish. Pipelining doesn't cause a slowdown. At worst you don't get as much of a speedup as you expected, but you do still get a speedup.
>Interrupts are > notorious for instance in causing catastrophic pipeline stalls and > cache misses. It makes chips hot and expensive and harder > to program, and programs get bigger, and it reduces production yield.
Again, this is rubbish. Modern CPUs have interrupt latencies that are tens of nano seconds, even with caches (eg. 50 ns on a 400 MHz Cortex-R4). Interrupts have a cost, but you can't live without them.
> The point here is getting up to a 1GHz instruction exection at the > mininal cost in transistors by not using pipelining or caches, > making a classs of programs smaller, faster, and simpler as > odd as the might sound.
Odd doesn't quite cut it. My first thought it was a joke :-) I'm not a circuit designer, but it's obvious to me that 1GHz is wildly optimistic, even in an advanced process. You would have less than 20 levels of logic for instruction fetch, decode, operand fetch, execute and writeback. That's impossible.
> The idea is you pay for transistors. You can get a many millions > to execute a billion pipelined instructions per second on one > processor with parallel pipelines, or you can have a large number > of non-pipelined processors runing at that same speed perhaps > thousands of them.
Or you could have a small number of simple but fast CPUs. I agree that using many simpler CPUs uses transistors more efficiently and so can be cheaper, but there is a cost: software. So it remains cheaper overall to go for fewer faster CPUs. Wilco
In comp.arch Del Cecchi <cecchinospam@us.ibm.com> wrote:
[...various...]
> > Do you have links to websites? > www.ibm.com > www.intel.com > www.amd.com > And there is always the mips processor by i think, broadcom.
Couldna put it betta meesel', govvna. ;-)
>[... timewarp reference..]
In comp.arch Wilco Dijkstra <Wilco_dot_Dijkstra@ntlworld.com> wrote:
> "russell kym horsell" <kym@ukato.freeshell.org> wrote in message > news:e8sj7g$2rt$1@chessie.cirr.com... > > In comp.arch Wilco Dijkstra <Wilco_dot_Dijkstra@ntlworld.com> wrote: > >> > My example urls, at the top of my request for information post, > >> > reference parallel processors and may execute DOZENS of INSTRUCTION > >> > STREAMS SIMULTANEOUSLY!!! > >> And why is that good? Parallel computers have existed for many > >> decades, and chips with multiple cores are not exactly new either. > >> The problem is that apart from a few exceptions it is extremely > >> difficult to program them and get anywhere near peak performance. > >> This is why people look for the fastest CPUs available first and > >> only then look at using them in parallel. > > ... > > Maybe this drags things back to the reputed thread subject ;-) . > > Instead of getting the highest performance at any price, the idea > > behind cluster design is to get the best performance per $ and scale > > up to your budget. > Yes, it all depends on your goals and how much money you're > willing to spend. You get the best performance per dollar indeed by > going for the low end of desktop CPUs rather than paying 4 times as > much for only 20% performance. That is very different from what the > OP proposed though.
Sure. Different market segments get around Arrow's Theorem in their own way. If you have sya <5K to spend then you will likely find a pre-prepared package (i.e. the water-cooled overclocked bleeding edge approach ;) is going to be the best bang per buck. But if there is a more flexible ceiling the same paradigm isn't scale invariant. The scalable solution is to find a low cost basic unit with best bang-per-buck and then scale up to the price ceiling. Modifying this idea with a something like a Black/Scholes model to handle the vaguaries of rapid technology change -- building up a cluster over a period of years -- can bring the real-$ price down even further. (Kinda related to the problem -- do we send a probe to Alpha Cent. *now*, or decide a probe sent in 10 years will overtake it).
> > For some years I've been running a small business selling computer time > > on essentially home-built clusters. Obviously (?) I don't make my own > > chips or even have the patience to play with FPGA's. So getting to > > $US1 per GIPS (although in my neck of the woods we'd be talking about > > FLOPS) is a few years off, yet. But maximizing bang per buck is > > certainly in the $US10-$100 per FLOPS range when building 100s crunch > > nodes > > at a time. I.e. old 1 GHZ chips, old mobos, old SDRAM's, and well-designed > > s/w using SSE{1,2}. The "exact opposite" (modulo engineering mentality) > > of going high-performance/bleeding edge.
> $10 per GFLOPS sounds impossible, where do you get that number from?
Ahhh. I said "in the range". Do the calc yourself. Just taking the chip (I am prev talking from the point of view of "something in a rack with the software, power, and some staff to look over it") you can see a 2 GHz chip (e.g. an old XP) with 2+ FP pipelines and a couple of integer pipelines and/or addr generators, running post-Gotosan matmul code can cruise for seconds at 12 GIPS/4+ GFLOPS and cost < $US100 in 100s or 1000s. I think I'm on the money. :)
>[...]
anon wrote:

(snip)

> Pipelining DOESN'T reduce execution time of individual instructions, it > statistically increases it!
That depends on how you measure it. The simplest pipeline separates instruction decode from instruction execution. One could then say that execution is faster without the need to wait for decode.
> Pipelining DOES increase THE RATE at which INSTRUCTION STREAMS complete > execution, however, NO SINGLE INSTRUCTION RUNS FASTER!
Another thing that goes along with pipelining is register renaming. Register move instructions then execute in zero time, as the result goes directly to or from the appropriate register. -- glen
glen herrmannsfeldt wrote:
> anon wrote: > > (snip) > > > Pipelining DOESN'T reduce execution time of individual instructions, it > > statistically increases it! > > That depends on how you measure it. The simplest pipeline separates > instruction decode from instruction execution. One could then say > that execution is faster without the need to wait for decode. >
no single instruction runs faster, therfore, < insert IBM marketing .here> ...
> > Pipelining DOES increase THE RATE at which INSTRUCTION STREAMS complete > > execution, however, NO SINGLE INSTRUCTION RUNS FASTER! > > Another thing that goes along with pipelining is register renaming. > Register move instructions then execute in zero time, as the result > goes directly to or from the appropriate register. > > -- glen
AFAIK, using the two URLs provided at the top of this post for reference, / simulation of register swap in forth REGF @ REGA @ REGF ! REGA ! requires 12 nanoseconds to execute as a program sequence where REGx are statically bound. PLUS, the architecture, as example, is PARALLEL, and therfore, may in theory perform MULTIPLE register swaps in the same time of twelve nanoseconds. ( I've been wanting theorizing an on-chip RAM zero page swaping thru a memory mapped IO switches, for more FAST on-chip RAM ) Regards,
glen herrmannsfeldt wrote:
> [...] > Another thing that goes along with pipelining is register renaming. > Register move instructions then execute in zero time, as the result > goes directly to or from the appropriate register.
Ya' learn something new everyday around here. I'd heard of "register renaming" but never knew this was why it was used. Thanks glen! --Randy