Jeorg's question on sci.electronics.design for an under $2 DSP chip got me to thinking: How are 1-cycle multipliers implemented in silicon? My understanding is that when you go buy a DSP chip a good part of the real estate is taken up by the multiplier, and this is a good part of the reason that DSPs cost so much. I can't see it being a big gawdaful batch of combinatorial logic that has the multiply rippling through 16 32-bit adders, so I assume there's a big table look up involved, but that's as far as my knowledge extends. Yet the reason that you go shell out all the $$ for a DSP chip is to get a 1-cycle MAC that you have to bury in a few (or several) tens of cycles worth of housekeeping code to set up the pointers, counters, modes &c -- so you never get to multiply numbers in one cycle, really. How much less silicon would you use if an n-bit multiplier were implemented as an n-stage pipelined device? If I wanted to implement a 128-tap FIR filter and could live with 160 ticks instead of 140 would the chip be much smaller? Or is the space consumed by the separate data spaces and buses needed to move all the data to and from the MAC? If you pipelined the multiplier _and_ made it a two- or three- cycle MAC (to allow time to shove data around) could you reduce the chip cost much? Would the amount of area savings you get allow you to push the clock up enough to still do audio applications for less money? Obviously any answers will be useless unless somebody wants to run out and start a chip company, but I'm still curious about it. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
MAC Architectures
Started by ●October 19, 2005
Reply by ●October 19, 20052005-10-19
Tim Wescott wrote:> Jeorg's question on sci.electronics.design for an under $2 DSP chip got > me to thinking: > > How are 1-cycle multipliers implemented in silicon? My understanding is > that when you go buy a DSP chip a good part of the real estate is taken > up by the multiplier, and this is a good part of the reason that DSPs > cost so much. I can't see it being a big gawdaful batch of > combinatorial logic that has the multiply rippling through 16 32-bit > adders, so I assume there's a big table look up involved, but that's as > far as my knowledge extends. >There's no lookup table. Its just a BIG cascade of and's. This might help: http://www2.ele.ufes.br/~ailson/digital2/cld/chapter5/chapter05.doc5.html> Yet the reason that you go shell out all the $$ for a DSP chip is to get > a 1-cycle MAC that you have to bury in a few (or several) tens of cycles > worth of housekeeping code to set up the pointers, counters, modes &c -- > so you never get to multiply numbers in one cycle, really. > > How much less silicon would you use if an n-bit multiplier were > implemented as an n-stage pipelined device? If I wanted to implement a > 128-tap FIR filter and could live with 160 ticks instead of 140 would > the chip be much smaller? >I think this would lead to lousy performance on small loops - such as those found in JPEG encoding.> Or is the space consumed by the separate data spaces and buses needed to > move all the data to and from the MAC? If you pipelined the multiplier > _and_ made it a two- or three- cycle MAC (to allow time to shove data > around) could you reduce the chip cost much? Would the amount of area > savings you get allow you to push the clock up enough to still do audio > applications for less money?Quite a lot of the chip cost depends on the design complexity and the amount of time and money spent in R&D, not to mention the quantity of chips the company hopes to sell, so its not a direct proportional relation between cost and size of chip. If you're trying to save money, you could try using a fast general purpose microcontroller instead of a DSP.> > Obviously any answers will be useless unless somebody wants to run out > and start a chip company, but I'm still curious about it. > > -- > > Tim Wescott > Wescott Design Services > http://www.wescottdesign.com
Reply by ●October 19, 20052005-10-19
Your question got me thinking, trying to recall the discussions I had in the microprocessor architecture classes. So here is some food for thought: I seem to recall that (back then? - 99 -> 01) that multipliers were assumed to take multiple cycles, I think for the class purposes we usually assumed three or four cycles. Sometimes the premise was that there were multiple multipliers and other ALU units that could be used simulataneously. If an instruction was set to execute and there weren't resources available, this resulted in a pipeline stall, but otherwise the apparent output was single cycle. I even believe we had test problems dealing with determining how many multipliers a processor required versus other resource items (each with a $ value attatched), given a certain mix of instructions and having to determine the optimal resource mix. In the latter portions of the class, we got away from the CPU architecture and spent a lot of time dealing with the concept of maintaining single cycle execution through the use of compiler scheduling. A lot of emphasis was placed on scheduling algorthims that scanned for data and resource dependancies and how code will get executed out of sequence to maximize resource utilization. Another concept that was raised is the idea of sub cycle (clocking) or micro-operations where in a single "instruction cylce" multiple processor cycles would occur while still maintaing the apparent single cycle execution. I would imagine that modern DSPs rely on techniques like these, or some totally new ones, to maximize the throughput.
Reply by ●October 19, 20052005-10-19
Pramod Subramanyan wrote:> Tim Wescott wrote: > >>Jeorg's question on sci.electronics.design for an under $2 DSP chip got >>me to thinking: >> >>How are 1-cycle multipliers implemented in silicon? My understanding is >>that when you go buy a DSP chip a good part of the real estate is taken >>up by the multiplier, and this is a good part of the reason that DSPs >>cost so much. I can't see it being a big gawdaful batch of >>combinatorial logic that has the multiply rippling through 16 32-bit >>adders, so I assume there's a big table look up involved, but that's as >>far as my knowledge extends. >> > > > There's no lookup table. Its just a BIG cascade of and's. This might > help: > > http://www2.ele.ufes.br/~ailson/digital2/cld/chapter5/chapter05.doc5.html >Interesting. So that's what they actually do in practice, just copy a page out of a textbook? Wouldn't the stages of adders really cause a speed hit? To have your signal ripple through so many stages would require you to slow your clock way down from what it could be otherwise -- it seems an odd way to build a chip who's purpose in life is to be really fast while doing a MAC.> >>Yet the reason that you go shell out all the $$ for a DSP chip is to get >>a 1-cycle MAC that you have to bury in a few (or several) tens of cycles >>worth of housekeeping code to set up the pointers, counters, modes &c -- >>so you never get to multiply numbers in one cycle, really. >> >>How much less silicon would you use if an n-bit multiplier were >>implemented as an n-stage pipelined device? If I wanted to implement a >>128-tap FIR filter and could live with 160 ticks instead of 140 would >>the chip be much smaller? >> > > I think this would lead to lousy performance on small loops - such as > those found in JPEG encoding. >Good point. Yes it would, unless you used some fancy pipelining to keep the throughput up (which would probably require a fancy optimizer to let humans write fast code).> >>Or is the space consumed by the separate data spaces and buses needed to >>move all the data to and from the MAC? If you pipelined the multiplier >>_and_ made it a two- or three- cycle MAC (to allow time to shove data >>around) could you reduce the chip cost much? Would the amount of area >>savings you get allow you to push the clock up enough to still do audio >>applications for less money? > > > Quite a lot of the chip cost depends on the design complexity and the > amount of time and money spent in R&D, not to mention the quantity of > chips the company hopes to sell, so its not a direct proportional > relation between cost and size of chip. If you're trying to save money, > you could try using a fast general purpose microcontroller instead of a > DSP. >Yet DSP chips cost tons of money, which disappoints Jeorg who designs for high-volume customers who are _very_ price sensitive. The question was more a hypothetical "what would Atmel do if Atmel wanted to compete with the dsPIC" than "should I have a custom chip designed for my 10-a-year production cycle".> >>Obviously any answers will be useless unless somebody wants to run out >>and start a chip company, but I'm still curious about it. >> >-- Tim Wescott Wescott Design Services http://www.wescottdesign.com
Reply by ●October 19, 20052005-10-19
On Wed, 19 Oct 2005 10:42:41 -0700, Tim Wescott wrote:> Yet DSP chips cost tons of money, which disappoints Jeorg who designs > for high-volume customers who are _very_ price sensitive.Actually, I believe that prices have little to do with cost, particularly in high volume, low material cost items like ICs. This is true until the item has become a commodity, where anybody can make it. At that point, market factors start to bring the prices down. Until that point, pricing is more closely related to the cost of what the item replaces. In the case of DSP chips, the replacement is a traditional microprocessor, with its fast external memory, PC design and debug time, etc. So, cutting down on the silicon area won't help prices; it'll just increase the profits of the chipmakers. What helps prices is stiff, fair competition, and lots of it. So, chip makers try to differentiate their designs, making it hard to 'jump ship' and head off in new directions, thus keeping a a particular group of users a 'captive audience'. Once standardization sets in, they are doomed to compete. --- Regards, Bob Monsen Let us grant that the pursuit of mathematics is a divine madness of the human spirit. - Alfred North Whitehead
Reply by ●October 19, 20052005-10-19
Tim Wescott wrote:> Jeorg's question on sci.electronics.design for an under $2 DSP chip got > me to thinking: > > How are 1-cycle multipliers implemented in silicon? My understanding is > that when you go buy a DSP chip a good part of the real estate is taken > up by the multiplier, and this is a good part of the reason that DSPs > cost so much. I can't see it being a big gawdaful batch of > combinatorial logic that has the multiply rippling through 16 32-bit > adders, so I assume there's a big table look up involved, but that's as > far as my knowledge extends. >Single-cycle multipliers in small microcontrollers are frequently 8x8, which is obviously much easier. The chip mentioned, the msp430, does 16x16, but it is not actually single-cycle (as far as I remember). The other big difference compared to expensive DSPs is the speed - it is a lot easier to do 16x16 in a single cycle at 8 MHz (the top speed of the current msp430's) than at a few hundred MHz (for expensive DSPs).
Reply by ●October 19, 20052005-10-19
Tim Wescott wrote:> Pramod Subramanyan wrote: > >> Tim Wescott wrote: >> >>> Jeorg's question on sci.electronics.design for an under $2 DSP chip got >>> me to thinking: >>> >>> How are 1-cycle multipliers implemented in silicon? My understanding is >>> that when you go buy a DSP chip a good part of the real estate is taken >>> up by the multiplier, and this is a good part of the reason that DSPs >>> cost so much. I can't see it being a big gawdaful batch of >>> combinatorial logic that has the multiply rippling through 16 32-bit >>> adders, so I assume there's a big table look up involved, but that's as >>> far as my knowledge extends. >>> >> >> >> There's no lookup table. Its just a BIG cascade of and's. This might >> help: >> >> http://www2.ele.ufes.br/~ailson/digital2/cld/chapter5/chapter05.doc5.html >> > Interesting. So that's what they actually do in practice, just copy a > page out of a textbook? Wouldn't the stages of adders really cause a > speed hit? To have your signal ripple through so many stages would > require you to slow your clock way down from what it could be otherwise > -- it seems an odd way to build a chip who's purpose in life is to be > really fast while doing a MAC.It's much much harder than just copying a page out of a textbook. There's small optimizations that depend strongly on data distributions, etc etc. Even before the designer can begin laying out the multiplier, which is pretty much the hardest part, they have to work out whether it has the characteristics required. As an example I recently designed a 4bit*4bit multiplier as a class project. It's much harder than many people realise to do, and it's complexity grows exponentially (in most cases) to the input bit width. Sometimes it may be as simple as laying down a standard multiplier block (from one of many IP libraries around) however in most DSPs this will be the critical timing path for single cycle operation and so must be hand modified to produce acceptable path delays, then assessed under all conditions. Certainly not a lookup table, that would indeed be simply copying from a book, and would also require (2^(2*N))*N/4 bytes of storage. For anything but small N this would be enormous, and not very efficient in terms of chip real estate. As an aside, the other members of my class implemented their multipliers in a pipeline configuration, whilst I did mine in a completely parallel configuration (with ripple adder as high speed wasn't a design consideration). This means that others had 2/3/4 cycle latencies whilst mine was a single cycle. The trade-off is that the upper frequency of mine was more limited than was their's due to the increased path delays. Getting single cycle high speed multipliers is a very challenging prospect, and one which much research is still ongoing. You should have a go at making up a simple 3bit*3bit multiplier using single transistors on a PCB sometime.. it's quite similar to the layout flow used in IC design.
Reply by ●October 19, 20052005-10-19
Newer FPGAs have lots of fast 18 x 18 multipliers. The humble XC4VSX25 has, among other goodies, 128 such multipliers running at max 500MHz single-cycle rate. The mid-range SX35 has 192, and the top SX55 has 512 such fast 18 x 18 multipliers each with its associated 48-bit accumulator structure. We invite you to keep that kind of arithmetic performance busy... No wonder these FPGAs can outperform sophisticated and expensive DSP chips. Peter Alfke, Xilinx
Reply by ●October 19, 20052005-10-19
Tim Wescott skrev:> Pramod Subramanyan wrote: >snip> > > > http://www2.ele.ufes.br/~ailson/digital2/cld/chapter5/chapter05.doc5.html > > > Interesting. So that's what they actually do in practice, just copy a > page out of a textbook? Wouldn't the stages of adders really cause a > speed hit? To have your signal ripple through so many stages would > require you to slow your clock way down from what it could be otherwiseafair the delay for the straight forward N*N bit parallel multiplier is only around double the delay of a N bit adder, i.e. the longest path in the multiplier is lsb to msb plus top to bottom> -- it seems an odd way to build a chip who's purpose in life is to be > really fast while doing a MAC.I think its more likely that they look at different options and find the smallest that is fast enough ;) have a look at http://www.andraka.com/multipli.htm> >snip> Yet DSP chips cost tons of money, which disappoints Jeorg who designs > for high-volume customers who are _very_ price sensitive. The question > was more a hypothetical "what would Atmel do if Atmel wanted to compete > with the dsPIC" than "should I have a custom chip designed for my > 10-a-year production cycle".I'm not sure the size of the multiplier makes a big difference, my guess is that if you look at the die you would see that most of it is memory what price are you looking for?, how much memory?, how fast? Not that I will build you one, but I'm curious :) -Lasse
Reply by ●October 19, 20052005-10-19
On Wed, 19 Oct 2005 09:25:12 -0700, Tim Wescott <tim@seemywebsite.com> wrote:>Jeorg's question on sci.electronics.design for an under $2 DSP chip got >me to thinking: > >How are 1-cycle multipliers implemented in silicon? My understanding is >that when you go buy a DSP chip a good part of the real estate is taken >up by the multiplier, and this is a good part of the reason that DSPs >cost so much. I can't see it being a big gawdaful batch of >combinatorial logic that has the multiply rippling through 16 32-bit >adders, so I assume there's a big table look up involved, but that's as >far as my knowledge extends. > >Yet the reason that you go shell out all the $$ for a DSP chip is to get >a 1-cycle MAC that you have to bury in a few (or several) tens of cycles >worth of housekeeping code to set up the pointers, counters, modes &c -- >so you never get to multiply numbers in one cycle, really. > >How much less silicon would you use if an n-bit multiplier were >implemented as an n-stage pipelined device? If I wanted to implement a >128-tap FIR filter and could live with 160 ticks instead of 140 would >the chip be much smaller? > >Or is the space consumed by the separate data spaces and buses needed to >move all the data to and from the MAC? If you pipelined the multiplier >_and_ made it a two- or three- cycle MAC (to allow time to shove data >around) could you reduce the chip cost much? Would the amount of area >savings you get allow you to push the clock up enough to still do audio >applications for less money? > >Obviously any answers will be useless unless somebody wants to run out >and start a chip company, but I'm still curious about it.A while back when I was doing such things Wallace Trees and Booth Multipliers were all the rage. Doing a search on those turned up Ray Andraka's page (no big surprise :)) which has a really good discussion on alternatives. Since then things have gotten even smaller and faster and, as someone else pointed out, the FPGA companies now find it prudent to splatter large numbers of very fast single-cycle multipliers around their parts just because they can (and becuase they know people will use them). I've no clue what they're doing there, but efficient single-cycle multipliers have been around for a long time in various flavors. I'm sure they're not all the same. Eric Jacobsen Minister of Algorithms, Intel Corp. My opinions may not be Intel's opinions. http://www.ericjacobsen.org






