# High-performance eval kits with Linux support

Started by November 2, 2006
```Hello,

I am a newcomer to the world of DSPs and I am looking for a standalone kit
that is able to provide 8000+ MMACs, has a I2C bus/controller and can be
developed under a Linux based environment. The Blackfin STAMP boards were
perfect in every way other than the performance figure. The TI TMS320C6455
DSK offers 8000 MMACs but requires a Windows based development environment.
Also, I'm not sure if the I2C pins on the SRIO connector can be used just
by themselves.

The algorithm that I would like to implement has the following steps:

1. Find the dot-products of vectors A & Bk, where k = 1,..,150 (size of
each vector is 2048) ==> 150 dot-products of size 2048 ==> 307200 MAC ops
(?)

2. Find the max & its index, i ==> 150 comparisons (?)

3. Normalize the max by ||A|| & ||Bi|| ==> 2x2048 = 4096 MACs+ (?)

4. Repeat 1-3 every 224 microseconds.

--> (307200 + 4096) / 224e-6 = 1390 MMACs

So I'd need at least 1500 MMACs per second. Furthermore, I will need to
have at least 4 such processes simultaneously, meaning at least 6000
MMACs. Is my analysis of the performance requirement correct?

Mostafa

```
```mafgani wrote:

> Hello,
>
> I am a newcomer to the world of DSPs and I am looking for a standalone kit
> that is able to provide 8000+ MMACs, has a I2C bus/controller and can be
> developed under a Linux based environment. The Blackfin STAMP boards were
> perfect in every way other than the performance figure. The TI TMS320C6455
> DSK offers 8000 MMACs but requires a Windows based development environment.
> Also, I'm not sure if the I2C pins on the SRIO connector can be used just
> by themselves.
>
> The algorithm that I would like to implement has the following steps:
>
> 1. Find the dot-products of vectors A & Bk, where k = 1,..,150 (size of
> each vector is 2048) ==> 150 dot-products of size 2048 ==> 307200 MAC ops
> (?)
>
> 2. Find the max & its index, i ==> 150 comparisons (?)
>
> 3. Normalize the max by ||A|| & ||Bi|| ==> 2x2048 = 4096 MACs+ (?)
>
> 4. Repeat 1-3 every 224 microseconds.
>
> --> (307200 + 4096) / 224e-6 = 1390 MMACs
>
> So I'd need at least 1500 MMACs per second. Furthermore, I will need to
> have at least 4 such processes simultaneously, meaning at least 6000
> MMACs. Is my analysis of the performance requirement correct?

Looks correct to me. Are you sure you don't have to do step 3 first,
then step 2? What kind of data is stored in the vectors (number of
bits, fixed / floating point)?

Regards,
Andor

```
```"Andor" <andor.bariska@gmail.com> writes:

> mafgani wrote:
>
>> Hello,
>>
>> I am a newcomer to the world of DSPs and I am looking for a standalone kit
>> that is able to provide 8000+ MMACs, has a I2C bus/controller and can be
>> developed under a Linux based environment. The Blackfin STAMP boards were
>> perfect in every way other than the performance figure. The TI TMS320C6455
>> DSK offers 8000 MMACs but requires a Windows based development environment.
>> Also, I'm not sure if the I2C pins on the SRIO connector can be used just
>> by themselves.
>>
>> The algorithm that I would like to implement has the following steps:
>>
>> 1. Find the dot-products of vectors A & Bk, where k = 1,..,150 (size of
>> each vector is 2048) ==> 150 dot-products of size 2048 ==> 307200 MAC ops
>> (?)
>>
>> 2. Find the max & its index, i ==> 150 comparisons (?)
>>
>> 3. Normalize the max by ||A|| & ||Bi|| ==> 2x2048 = 4096 MACs+ (?)
>>
>> 4. Repeat 1-3 every 224 microseconds.
>>
>> --> (307200 + 4096) / 224e-6 = 1390 MMACs
>>
>> So I'd need at least 1500 MMACs per second. Furthermore, I will need to
>> have at least 4 such processes simultaneously, meaning at least 6000
>> MMACs. Is my analysis of the performance requirement correct?
>
> Looks correct to me.

Doesn't the norm in step 3 require a square root (2-norm)? If so, that
--
%  Randy Yates                  % "Though you ride on the wheels of tomorrow,
%% Fuquay-Varina, NC            %  you still wander the fields of your
%%% 919-577-9882                %  sorrow."
%%%% <yates@ieee.org>           % '21st Century Man', *Time*, ELO
```
```Hello Andor,

>Looks correct to me. Are you sure you don't have to do step 3 first,
>then step 2?

Yes, you're right about that. Meaning I need at least twice as many MMACs
:(.

> What kind of data is stored in the vectors (number of
>bits, fixed / floating point)?

The values are 8-bit fixed point (I think).

>Regards,
>Andor

Thanks,
Mostafa
```
```Hi Randy,

>Doesn't the norm in step 3 require a square root (2-norm)? If so, that

Yes, it does. That's the reason I assumed it would take at least 1500
MMACs.

>%  Randy Yates

-Mostafa
```
```mafgani wrote:
> Hello Andor,
>
> >Looks correct to me. Are you sure you don't have to do step 3 first,
> >then step 2?
>
> Yes, you're right about that. Meaning I need at least twice as many MMACs
> :(.

Looks like it.

>
> > What kind of data is stored in the vectors (number of
> >bits, fixed / floating point)?
>
>
> The values are 8-bit fixed point (I think).

That's rather crucial. An ADI TigerSHARC can supply 4000 16bit
fixed-point MMACs (am not sure, but I don't think it supports faster
8bit MACs), so you need a card that has at least four of those on board
(perhaps Bittware or Transtech have such cards), but they are likely
not to run under Linux. Clearspeed also has a card for PC which
supports such high performance (not on SHARC but with a processor of
their own design).

This actually sounds like an application for an FPGA (simple code
structure, small data size, high performance). I'm sure you'll find
something out there that runs under Linux.

Regards,
Andor

```
```Hi Andor,

>That's rather crucial. An ADI TigerSHARC can supply 4000 16bit
>fixed-point MMACs (am not sure, but I don't think it supports faster
>8bit MACs), so you need a card that has at least four of those on board
>(perhaps Bittware or Transtech have such cards), but they are likely
>not to run under Linux. Clearspeed also has a card for PC which
>supports such high performance (not on SHARC but with a processor of
>their own design).
>

Thanks for pointing those out. I've already come across those myself but
the problem there is that they all seem to require a backbone to connect
to (they are all AMC/PCI/VME card). I need something that I can use as
standalone hardware.

>This actually sounds like an application for an FPGA (simple code
>structure, small data size, high performance). I'm sure you'll find
>something out there that runs under Linux.
>

I had the feeling that FPGAs are rather slow. Would any of the Xilinx
chips provide the performance I'm after? Besides, during the next stages
of the project, I will have to implement a neural network and some kind of
sensor fusion algorithm too -- so, I don't think it will remain as
straightforward as it is now...

>Regards,
>Andor

Thanks,
Mostafa
```
```Hi Andor,

Actually, I just had a look at the Xilinx Virtex 4 SX35 eval board and I
think it should fit my needs adequately. Thanks for pointing out FPGAs as
a potential solution.

Thanks,
Mostafa
```