Forums

High-performance eval kits with Linux support

Started by mafgani November 2, 2006
Hello,

I am a newcomer to the world of DSPs and I am looking for a standalone kit
that is able to provide 8000+ MMACs, has a I2C bus/controller and can be
developed under a Linux based environment. The Blackfin STAMP boards were
perfect in every way other than the performance figure. The TI TMS320C6455
DSK offers 8000 MMACs but requires a Windows based development environment.
Also, I'm not sure if the I2C pins on the SRIO connector can be used just
by themselves.

The algorithm that I would like to implement has the following steps:

1. Find the dot-products of vectors A & Bk, where k = 1,..,150 (size of
each vector is 2048) ==> 150 dot-products of size 2048 ==> 307200 MAC ops
(?)

2. Find the max & its index, i ==> 150 comparisons (?)

3. Normalize the max by ||A|| & ||Bi|| ==> 2x2048 = 4096 MACs+ (?) 

4. Repeat 1-3 every 224 microseconds.

--> (307200 + 4096) / 224e-6 = 1390 MMACs

So I'd need at least 1500 MMACs per second. Furthermore, I will need to
have at least 4 such processes simultaneously, meaning at least 6000
MMACs. Is my analysis of the performance requirement correct?


Thanks in advance,
Mostafa


mafgani wrote:

> Hello, > > I am a newcomer to the world of DSPs and I am looking for a standalone kit > that is able to provide 8000+ MMACs, has a I2C bus/controller and can be > developed under a Linux based environment. The Blackfin STAMP boards were > perfect in every way other than the performance figure. The TI TMS320C6455 > DSK offers 8000 MMACs but requires a Windows based development environment. > Also, I'm not sure if the I2C pins on the SRIO connector can be used just > by themselves. > > The algorithm that I would like to implement has the following steps: > > 1. Find the dot-products of vectors A & Bk, where k = 1,..,150 (size of > each vector is 2048) ==> 150 dot-products of size 2048 ==> 307200 MAC ops > (?) > > 2. Find the max & its index, i ==> 150 comparisons (?) > > 3. Normalize the max by ||A|| & ||Bi|| ==> 2x2048 = 4096 MACs+ (?) > > 4. Repeat 1-3 every 224 microseconds. > > --> (307200 + 4096) / 224e-6 = 1390 MMACs > > So I'd need at least 1500 MMACs per second. Furthermore, I will need to > have at least 4 such processes simultaneously, meaning at least 6000 > MMACs. Is my analysis of the performance requirement correct?
Looks correct to me. Are you sure you don't have to do step 3 first, then step 2? What kind of data is stored in the vectors (number of bits, fixed / floating point)? Regards, Andor
"Andor" <andor.bariska@gmail.com> writes:

> mafgani wrote: > >> Hello, >> >> I am a newcomer to the world of DSPs and I am looking for a standalone kit >> that is able to provide 8000+ MMACs, has a I2C bus/controller and can be >> developed under a Linux based environment. The Blackfin STAMP boards were >> perfect in every way other than the performance figure. The TI TMS320C6455 >> DSK offers 8000 MMACs but requires a Windows based development environment. >> Also, I'm not sure if the I2C pins on the SRIO connector can be used just >> by themselves. >> >> The algorithm that I would like to implement has the following steps: >> >> 1. Find the dot-products of vectors A & Bk, where k = 1,..,150 (size of >> each vector is 2048) ==> 150 dot-products of size 2048 ==> 307200 MAC ops >> (?) >> >> 2. Find the max & its index, i ==> 150 comparisons (?) >> >> 3. Normalize the max by ||A|| & ||Bi|| ==> 2x2048 = 4096 MACs+ (?) >> >> 4. Repeat 1-3 every 224 microseconds. >> >> --> (307200 + 4096) / 224e-6 = 1390 MMACs >> >> So I'd need at least 1500 MMACs per second. Furthermore, I will need to >> have at least 4 such processes simultaneously, meaning at least 6000 >> MMACs. Is my analysis of the performance requirement correct? > > Looks correct to me.
Doesn't the norm in step 3 require a square root (2-norm)? If so, that might add some MIPS. -- % Randy Yates % "Though you ride on the wheels of tomorrow, %% Fuquay-Varina, NC % you still wander the fields of your %%% 919-577-9882 % sorrow." %%%% <yates@ieee.org> % '21st Century Man', *Time*, ELO http://home.earthlink.net/~yatescr
Hello Andor,

>Looks correct to me. Are you sure you don't have to do step 3 first, >then step 2?
Yes, you're right about that. Meaning I need at least twice as many MMACs :(.
> What kind of data is stored in the vectors (number of >bits, fixed / floating point)?
The values are 8-bit fixed point (I think).
>Regards, >Andor
Thanks, Mostafa
Hi Randy,

>Doesn't the norm in step 3 require a square root (2-norm)? If so, that >might add some MIPS.
Yes, it does. That's the reason I assumed it would take at least 1500 MMACs.
>% Randy Yates
-Mostafa
mafgani wrote:
> Hello Andor, > > >Looks correct to me. Are you sure you don't have to do step 3 first, > >then step 2? > > Yes, you're right about that. Meaning I need at least twice as many MMACs > :(.
Looks like it.
> > > What kind of data is stored in the vectors (number of > >bits, fixed / floating point)? > > > The values are 8-bit fixed point (I think).
That's rather crucial. An ADI TigerSHARC can supply 4000 16bit fixed-point MMACs (am not sure, but I don't think it supports faster 8bit MACs), so you need a card that has at least four of those on board (perhaps Bittware or Transtech have such cards), but they are likely not to run under Linux. Clearspeed also has a card for PC which supports such high performance (not on SHARC but with a processor of their own design). This actually sounds like an application for an FPGA (simple code structure, small data size, high performance). I'm sure you'll find something out there that runs under Linux. Regards, Andor
Hi Andor,

>That's rather crucial. An ADI TigerSHARC can supply 4000 16bit >fixed-point MMACs (am not sure, but I don't think it supports faster >8bit MACs), so you need a card that has at least four of those on board >(perhaps Bittware or Transtech have such cards), but they are likely >not to run under Linux. Clearspeed also has a card for PC which >supports such high performance (not on SHARC but with a processor of >their own design). >
Thanks for pointing those out. I've already come across those myself but the problem there is that they all seem to require a backbone to connect to (they are all AMC/PCI/VME card). I need something that I can use as standalone hardware.
>This actually sounds like an application for an FPGA (simple code >structure, small data size, high performance). I'm sure you'll find >something out there that runs under Linux. >
I had the feeling that FPGAs are rather slow. Would any of the Xilinx chips provide the performance I'm after? Besides, during the next stages of the project, I will have to implement a neural network and some kind of sensor fusion algorithm too -- so, I don't think it will remain as straightforward as it is now...
>Regards, >Andor
Thanks, Mostafa
Hi Andor,

Actually, I just had a look at the Xilinx Virtex 4 SX35 eval board and I
think it should fit my needs adequately. Thanks for pointing out FPGAs as
a potential solution.

Thanks,
Mostafa