DSPRelated.com
Forums

DSP 'Beowulf'

Started by Jerry Parks September 5, 2003
"Jerry Parks" <j.parks@relmail.com> wrote in message =
news:au0plvs70nvtsgdt5opc6khtigruk8bh6e@4ax.com...
> On Sun, 7 Sep 2003 21:25:58 +1000, "Alex Gibson" <alxx@ihug.com.au> > wrote: >=20 > <snip> > >mmx ? sure you don't mean sse2. > >system streaming extensions. intels improved mmx >=20 > Like I said, I'm a bit out of touch on the hardware side > of things. >=20 > >For some workloads altivec (mmx equivalent for powerpc ) > >leaves sse2 for dead. >=20 > Yes, I just noticed Altivec shortly after my original posting. >=20 >=20 > >options for powerpc either IBM or motorola. > > > >beowulf wise maybe look at some opteron boxes > >or a blade server. > > > >Alex >=20 >=20 > What I would really like to see is a PPC SBC. Just the processor, > some RAM, and a way to talk to to it... >=20 > Jerry
look on the ibm microelectronics pages or on motorola pages http://www-3.ibm.com/chips/products/powerpc/ http://www-3.ibm.com/chips/techlib/techlib.nsf/products/4Mb_PPC_SRAM http://www.chips.ibm.com/products/powerpc/ you won't find a g5 but plenty of others http://www.google.com/search?sourceid=3Dmozclient&ie=3Dutf-8&oe=3Dutf-8&q= =3Dppc+sbc and there is always the xilinx ebk kit for the larger fpgas. virtex2pro with 4 ppc + gigabit io http://www.xilinx.com/ise/embedded/edk.htm (virtex2pro's are a bit pricey) Alex
Bernhard Holzmayer <holzmayer.bernhard@deadspam.com> wrote in message news:<24160273.Mg4TvRn7Ks@holzmayer.ifr.rt>...

> Another approach which comes to my mind: > FPGA have improved very much - in fact, you can even implement a > couple of Pentium processors on one chip (probably too expensive). > Depending on what "heavy math" means, and if it's more important to > have a pretty design or to get a problem solved, it might be worth > to check if an approach using FPGAs would be faster or better.
How would FPGAs perform with very dynamic data processing? I'm thinking of applications that are "dynamic" in terms of using iterative computational methods (SVDs and eigenvalue/eigen vector decompositions) that some times need only a few iterations to converge and at other times need more iteratons. Other "dynamic" computations could be principal component analyses were the number of principal components vary from time to time. Rune
In comp.dsp, Jerry Parks <j.parks@relmail.com> wrote:

>On Mon, 08 Sep 2003 08:43:18 +0200, Bernhard Holzmayer ><holzmayer.bernhard@deadspam.com> wrote: > > >>where come the data from, which you're going to process? >>Is your concept limiting the method how you connect the processors? >>Because you mention sound cards and ADI eval boards, I guess you're >>processing parallel data, aren't you? >> >>If input values come from parallel sources, look at the SPORT of >>ADI, where it is possible to process 8 or even more channels >>through a DMA, so that it "costs" nothing on the processor side. >>This DMA concept works too if you link two or more DSPs - no >>overhead at all! >>However, if '40' comes from 40 signals which you're going to >>process, you'd probably need at most 5 DSPs to receive them and not >>40. Are 5 eval boards within your reach? > > Nothing too exotic - just taking the product of large primes. > It is unfortunate that there are so many of them... :)
How much RAM would each processor need? Many lower-end DSP's only address 64k (16-bit words), and that may not even hold your big numbers. My feeling is you're looking for megabytes or hundreds of megabytes of RAM on each processor. You could connect this much RAM to any DSP through port I/O, though this would be a bottleneck. Use a DSP with an external bus connection and you could bank-select the high address bits and directly access 32k or so blocks at a time. The speed would better approach that of a large linear address space, but it would still mess with the programming to have to write the bank selects for the high bits. You ARE using FFT multiplies for large numbers, aren't you? I've not used this method myself, but it's a huge amount faster than the usual method expanded to arbitrarily large numbers. IIRC (visualising the operations in my mind), multiplying two N-digit numbers [or N-bit numbers doing 16x16 or 32x32 bits at a time] the 'usual' way takes N^2 time, and doing FFT's takes N log(N) time. For thousands of digits, this becomes several orders of magnitude faster. I've done five million factorial the simple 'slow' way (before I heard of using FFT's for this), it took five days on a P200.
>>On the other hand, if I imagine a mesh of 40 Sharc DSPs with a >>minimum hardware environment, you'd probably receive an impressive >>calculation power >>at moderate price. And it would certainly fit very well into the >>Beowulf concept (sorry, if I'm wrong: I'm not too familiar with >>that :-) - at least not much programming efforts...
One problem (or 'challenge') with multiprocessor implementations is doing the interprocessor communication (for everything - loading programs, transferring data, telling the program to run, knowing when the program is through with its calculations...). It looks like Beowulf has this already done for its environment (standard PC's with LAN cards, running Linux, with the 'Beowulf' interprocessor/intertask comm software available): I see lots of info at <http://beowulf.org>. You might could make DSP hardware that in quantity (maybe 1000 nodes?) is more cost-effective than a farm of PC's of equivalent computing power, but the absolute cost would still be high and you would still need to write a bunch of software for it. But maybe Linux has/will find its way onto some of the larger DSP's. I don't know of any off-the-shelf DSP boards that might be cost-effective for this. Standard PC motherboards have the huge advantage of high-volume manufacturing and competition to bring the cost way down. In short, DSP's might have a lower cost per MIPS/MFLOPS than current Pentiums, PowerPC's and competitors, but the cost of 'glue logic' and 'infrastructure' to get the final project going appears to strongly favor standard PC technology over DSP's.
>>Another approach which comes to my mind: >>FPGA have improved very much - in fact, you can even implement a >>couple of Pentium processors on one chip (probably too expensive). >>Depending on what "heavy math" means, and if it's more important to >>have a pretty design or to get a problem solved, it might be worth >>to check if an approach using FPGAs would be faster or better. >> >>I guess FPGAs would win the race as soon as you want to do something >>which is aside the usual and where the available processors on >>their as-is-basis don't fit. The biggest advantage of FPGAs is >>certainly that they are tailorable to whichever need. >>There are eval boards available for FPGAs, too. >> >>Just my 2c. > > Worth much more than that, I would say.
I suppose it's all a tradeoff: How much time/learning/money do you want to spend? If you want an 'optimum' answer then the first task is to check the feasibility of each of these approaches. Also, how soon do you want to start crunching numbers?
>>Bernhard > > > Jerry > > >
Steve Underwood wrote:
> Alex Gibson wrote: > > mmx ? sure you don't mean sse2. > > system streaming extensions. intels improved mmx > > > > For some workloads altivec (mmx equivalent for powerpc ) > > leaves sse2 for dead. > > > > options for powerpc either IBM or motorola. > > > > beowulf wise maybe look at some opteron boxes > > or a blade server. > > > > Alex > When using the integer part of SSE2, I have found it generally performs > slightly worse than MMX. For floats it works somewhat better. Its all a > pretty lousy form of DSP add on, though.
Hi Steve, a while back, the BDTI floating-point benchmark was discussed here (http://www.bdti.com/bdtimark/chip_scores.pdf). It rates a 1.4 GHz Pentium III almost seven times higher than a 100 MHz SIMD SHARC DSP for DSP algorithms. Is this your experience also? If this is true, then one single 3 GHz Pentium should easily match the processing power of 40+ old DSPs - with the added advantage of less inter-processor communication (speed bottleneck of parallel processors) and easier programmability (development bottleneck of parallel processors) plus faster access to more external memory. Regards, Andor
Rune Allnor wrote:

> Bernhard Holzmayer <holzmayer.bernhard@deadspam.com> wrote in > message news:<24160273.Mg4TvRn7Ks@holzmayer.ifr.rt>... > >> Another approach which comes to my mind: >> FPGA have improved very much - in fact, you can even implement a >> couple of Pentium processors on one chip (probably too >> expensive). Depending on what "heavy math" means, and if it's >> more important to >> have a pretty design or to get a problem solved, it might be >> worth to check if an approach using FPGAs would be faster or >> better. > > How would FPGAs perform with very dynamic data processing? > > I'm thinking of applications that are "dynamic" in terms of using > iterative computational methods (SVDs and eigenvalue/eigen vector > decompositions) that some times need only a few iterations to > converge and at other times need more iteratons. Other "dynamic" > computations could be principal component analyses were the number > of principal components vary from time to time. > > Rune
This didn't ever come to my mind - I only used FPGAs for high-speed filtering (usually as pre-stage before the DSP). If you're able to write your dynamic algorithm in ADA (which is pretty much the same as VHDL with respect to algoritm coding), you should be able to implement it in an FPGA, which will probably have a good overall performance because of the high processing speed of FPGAs. Especially in this region you often suffer from poor resolution during the calculation steps. If so, it might be advantageous that in an FPGA realisation you can define the width of a signal deliberately. It would be possible to keep a 64bit result of a multiplication, increasing it to 128 bit for the next multiplication and then reduce it where it is adequate. This would be difficult in a conventional DSP chip. Bernhard -- before sending to the above email-address: replace deadspam.com by foerstergroup.de