On Tue, 05 Aug 2003 03:57:30 -0700, Luiz Carlos wrote:

> Does anyone have seen the BDTI score for the TigerSHARC?
> 
> TigerSHARC (600MHz): 6280.
> BlackFin (600MHz):   3360.
> TMS320C64x (720Mz):  6480.
> 
> A far as I know the TigerSHARC is almost four times faster than the
> BlackFin and two times faster than the C64x at 16 bit operations (for
> the same clock frequency). So, why these scores? What am I missing?
> Was used 32 bit math for scoring the TigerSHARC?

I'm not particularly familiar with any of these processors, but for any
"real" code, of the sort that might be found in benchmarks, memory
bandwidth and latency usually dominate execution unit performance.  Are
you saying that the TigerSHARC has access to two times wider/faster
memory than the TI? (Maybe it does, I just don't know...)

-- 
Andrew

an2or@mailcircuit.com (Andor) wrote in message news:<ce45f9ed.0308160037.20914f7e@posting.google.com>...
> Kenton Williston wrote: 
> > Andor wrote: 
> > > Luiz Carlos wrote:
> > > ...
> > > > One more question. Why didn't BDTI score the TIGERSharc floating point
> > > > performance?
> > > 
> > > They scored the 21161N floating-point performance. As far as I
> > > remember, the 2116x and the TS have the same floating-point cores,
> > > just running at different clock rates. If their benchmark is to have
> > > any utility, it must be scalable with clock rate.
>  ...
> > Andor,
> > 
> > The architecture of the '2116x is significantly different than that of
> > the 'TS20x.  Hence, it is not possible to use the '2116x score to
> > project a floating-point score for the 'TS20x.
> 
> I know that the 2116x has dual independent (SIMD) 32/40bit
> floating-point units, each capable of a single cycle MAC instruction
> (that results in the 400 MFLOPS continuous score) and the
> multiply-add-subtract instruction (which results in the 600MFLOPS peak
> score), apart from the usual single cycle
> multiply/add/subtract/min/max/average etc. instructions.
> 
> Now from the data sheet of the TigerSHARC I gather it has the same
> floating-point core as the 2116x (dual 32/40 bit, single cycle MAC,
> single cycle mutliply-add-subtract for each FPU). Which would mean
> that, at least 32/40bit floating-point wise, the two cores are equal.
> Please correct me, I am no expert on the TS, I just read the data
> sheet.

There are many similarities between the '2116x and the 'TS20x, but
there are also many important differences between the two
architectures.  I cannot go into all the details here, but let me give
two simple examples:

- The '2116x uses a three-stage pipeline while the 'TS20x uses a
  ten-stage pipline.
- The '2116x has a maximum data bandwidth of 128 bits per cycle, while
  the 'TS20x has a maximum data bandwidth of 256 bits per cycle.

Due to these and other differences, it is not possible to use the
'2116x score to project a floating-point score for the 'TS20x.

> > For your information, the TS201 is available at 500 MHz and 600 MHz.
> 
> Yeah, it says so all over the ADI webpage. But if you go and read the
> latest data sheet for this processor (Rev. PrG, 6/03) and read the
> ordering guide, there is only one unit available, the
> ADSP-TS201SABP-ENG, with a nominal clock rate of 500 MHz.
> 
> This wouldn't be the first time that announced clock rates weren't met
> with the real product.
> 
> Regards,
> Andor

According to ADI, the TS201S is currently available at 600 MHz.  I
suggest you contact ADI for more information.

Best Regards,

Kenton Williston
DSP Analyst                    BDTI -- Berkeley Design Technology, Inc.
williston@BDTI.com                                  http://www.BDTI.com
Phone: +1 510-665-1600                             Fax: +1 510-665-1680
-----------------------------------------------------------------------
For free DSP industry news & analysis visit www.BDTI.com/dspinsider.htm
-----------------------------------------------------------------------
BDTI: Your source for independent DSP analysis & optimized DSP software
-----------------------------------------------------------------------

Kenton Williston wrote: 
> Andor wrote: 
> > Luiz Carlos wrote:
> > ...
> > > One more question. Why didn't BDTI score the TIGERSharc floating point
> > > performance?
> > 
> > They scored the 21161N floating-point performance. As far as I
> > remember, the 2116x and the TS have the same floating-point cores,
> > just running at different clock rates. If their benchmark is to have
> > any utility, it must be scalable with clock rate.
...
> Andor,
> 
> The architecture of the '2116x is significantly different than that of
> the 'TS20x.  Hence, it is not possible to use the '2116x score to
> project a floating-point score for the 'TS20x.

I know that the 2116x has dual independent (SIMD) 32/40bit
floating-point units, each capable of a single cycle MAC instruction
(that results in the 400 MFLOPS continuous score) and the
multiply-add-subtract instruction (which results in the 600MFLOPS peak
score), apart from the usual single cycle
multiply/add/subtract/min/max/average etc. instructions.

Now from the data sheet of the TigerSHARC I gather it has the same
floating-point core as the 2116x (dual 32/40 bit, single cycle MAC,
single cycle mutliply-add-subtract for each FPU). Which would mean
that, at least 32/40bit floating-point wise, the two cores are equal.
Please correct me, I am no expert on the TS, I just read the data
sheet.

> For your information, the TS201 is available at 500 MHz and 600 MHz.

Yeah, it says so all over the ADI webpage. But if you go and read the
latest data sheet for this processor (Rev. PrG, 6/03) and read the
ordering guide, there is only one unit available, the
ADSP-TS201SABP-ENG, with a nominal clock rate of 500 MHz.

This wouldn't be the first time that announced clock rates weren't met
with the real product.

Regards,
Andor

an2or@mailcircuit.com (Andor) wrote in message news:<ce45f9ed.0308150025.692e447a@posting.google.com>...
> Luiz Carlos wrote:
> ...
> > One more question. Why didn't BDTI score the TIGERSharc floating point
> > performance?
> 
> They scored the 21161N floating-point performance. As far as I
> remember, the 2116x and the TS have the same floating-point cores,
> just running at different clock rates. If their benchmark is to have
> any utility, it must be scalable with clock rate.
> 
> From
> 
> http://www.bdti.com/bdtimark/chip_scores.pdf
> 
> one sees the score of the 2116x at 100 MHz is 510, so the TS-20x at
> 500 MHz (I don't see any version on any datasheet which runs at 600
> MHz) should score about 2550.
> 
> Regards,
> Andor

Andor,

The architecture of the '2116x is significantly different than that of
the 'TS20x.  Hence, it is not possible to use the '2116x score to
project a floating-point score for the 'TS20x.

For your information, the TS201 is available at 500 MHz and 600 MHz.
The TS202 and TS203 are available at 500 MHz.

Best Regards,

Kenton Williston
DSP Analyst                    BDTI -- Berkeley Design Technology, Inc.
williston@BDTI.com                                  http://www.BDTI.com
Phone: +1 510-665-1600                             Fax: +1 510-665-1680
-----------------------------------------------------------------------
For free DSP industry news & analysis visit www.BDTI.com/dspinsider.htm
-----------------------------------------------------------------------
BDTI: Your source for independent DSP analysis & optimized DSP software
-----------------------------------------------------------------------

Luiz Carlos wrote:
...
> One more question. Why didn't BDTI score the TIGERSharc floating point
> performance?

They scored the 21161N floating-point performance. As far as I
remember, the 2116x and the TS have the same floating-point cores,
just running at different clock rates. If their benchmark is to have
any utility, it must be scalable with clock rate.

From

http://www.bdti.com/bdtimark/chip_scores.pdf

one sees the score of the 2116x at 100 MHz is 510, so the TS-20x at
500 MHz (I don't see any version on any datasheet which runs at 600
MHz) should score about 2550.

Regards,
Andor

oen_br@yahoo.com.br (Luiz Carlos) wrote in message news:<8471ba54.0308140241.5daa5ec1@posting.google.com>...
> 
> Hi Kenton,
> 
> It's nice to hear from BDTI.
> 
> As I said, I don't think you are lying, but I'm surprised about and
> trying to understand the TIGERSharc score.
> 
> Everything you said makes sense but, can you be more specific about
> this DSP?
> BlackFin and TMS320C64x are also SIMD processors, can you give us a
> piece of the test code to show us where the performance is lost?
> Was the Communications Logic Unit (CLU) used for the benchmarking?
> Is the test code handwritten in assembly?
> 
> One more question. Why didn't BDTI score the TIGERSharc floating point
> performance?
> 
> Luiz Carlos

Luiz,

Thanks for your interest in the scores.  I understand why it seems like
the TS20x should have a higher score.  For example, you can see that a
600 MHz TS20x can perform 4.8 billion 16-bit MACs per second, while a
720 MHz 'C64x can perform only 2.88 billion 16-bit MACs per second.

However, the TS20x is only able to realize this level of performance on
some of our benchmarks.  On other benchmarks, the TS20x is actually
slower than the 'C64x.  The reasons for this are well beyond what I
can explain here.  If you are interested in a detailed analysis of the
TS20x performance, please contact Jeremy Giddings (our Director of
Business Development) at giddings@BDTI.com or at +1 510 665 1600.

Note that our forthcoming report "Buyer's Guide to DSP Processors,
2004 Edition" will include details of the TS20x benchmarks and a
thorough analysis of the results.  This report will be published at
the end of the year.  You are welcome to order a copy now.

Let me answer your other questions:

- The CLU is used in our benchmarks, but not every feature of the CLU
  is exercised by our benchmarks.  We can provide analysis of these
  non-benchmarked features as part of a custom analysis.

- Regarding the coding techniques, yes, our benchmarks are hand-coded
  in assembly.

- We hope to benchmark the TS20x using floating-point data in the near
  future.  I encourage you to contact ADI and let them know your
  interest in floating-point BDTI Benchmark results.


Best Regards,


Kenton Williston
DSP Analyst                    BDTI -- Berkeley Design Technology, Inc.
williston@BDTI.com                                  http://www.BDTI.com
Phone: +1 510-665-1600                             Fax: +1 510-665-1680
-----------------------------------------------------------------------
For free DSP industry news & analysis visit www.BDTI.com/dspinsider.htm
-----------------------------------------------------------------------
BDTI: Your source for independent DSP analysis & optimized DSP software
-----------------------------------------------------------------------

williston@bdti.com (Kenton Williston) wrote in message news:<9eaf8c13.0308131416.527c4d04@posting.google.com>...
> BDTI is an independent company, and we are zealous about performing
> fair, objective benchmarking.  At the same time, we work closely with
> processor vendors (including Analog Devices) during the benchmarking
> process to ensure that no legitimate opportunity for optimization of
> the benchmark code is missed.
> 
> As some of you already pointed out, measuring a processor's
> signal-processing performance requires more than comparing MHz, MIPS,
> or number of MAC units.  With the BDTI Benchmarks, our approach is to
> implement and thoroughly optimize a set of twelve benchmark functions
> representing common DSP tasks.  BDTI ensures fair comparisons between
> processors by enforcing strict rules regarding the optimizations that
> are permitted, the amount of memory used, etc.
> 
> Our benchmark functions include not only algorithm kernels but also
> all the required entry (setup) and exit (cleanup) code. In other
> words, the benchmarks are complete modules that could be used directly
> in real-world applications; they are not synthetic code fragments.
> The overhead associated with the entry and exit code becomes
> significant for some of the shorter benchmarks,especially for
> processors with SIMD capabilities, just as it does in real
> applications.
> 
> To understand the benchmark scores, note that some functions in the
> BDTI Benchmark suite do not involve MACs at all--just as some
> real-world signal processing applications functions do not involve
> MACs.  These functions include supervisory control code,
> bit-manipulation code, and the Viterbi decoder algorithm.
> 
> Even on MAC-intensive benchmarks, the execution times are often longer
> than a simple analysis of MAC throughput would suggest.  This occurs
> for various reasons including architectural limitations, memory access
> latencies, and overhead associated with entry and exit code.
> 
> Taking all this into account, BDTI believes the 'TS20x BDTIsimMark2000
> score accurately represents the performance of this processor in
> typical 16-bit fixed-point DSP applications.
> 
> Note that the BDTImark2000/BDTIsimMark2000 only gives a first-order
> picture of a processor's performance in signal processing tasks.  We
> always recommend that processor users delve into more detailed
> analysis when selecting a processor.  For example, obviously users
> should pay close attention to individual benchmarks that resemble the
> application workload, but give less weight to those benchmarks that
> don't.
> 
> Further information about the BDTImark2000/BDTIsimMark2000 scores is
> available at:
> 
> http://www.bdti.com/bdtimark/BDTImark2000.htm
> 
> 
> Best Regards,
> 
> Kenton Williston
> DSP Analyst                    BDTI -- Berkeley Design Technology, Inc.
> williston@BDTI.com                                  http://www.BDTI.com
> Phone: +1 510-665-1600                             Fax: +1 510-665-1680
> -----------------------------------------------------------------------
> For free DSP industry news & analysis visit www.BDTI.com/dspinsider.htm
> -----------------------------------------------------------------------
> BDTI: Your source for independent DSP analysis & optimized DSP software
> -----------------------------------------------------------------------


Hi Kenton,

It's nice to hear from BDTI.

As I said, I don't think you are lying, but I'm surprised about and
trying to understand the TIGERSharc score.

Everything you said makes sense but, can you be more specific about
this DSP?
BlackFin and TMS320C64x are also SIMD processors, can you give us a
piece of the test code to show us where the performance is lost?
Was the Communications Logic Unit (CLU) used for the benchmarking?
Is the test code handwritten in assembly?

One more question. Why didn't BDTI score the TIGERSharc floating point
performance?

Luiz Carlos

BDTI is an independent company, and we are zealous about performing
fair, objective benchmarking.  At the same time, we work closely with
processor vendors (including Analog Devices) during the benchmarking
process to ensure that no legitimate opportunity for optimization of
the benchmark code is missed.

As some of you already pointed out, measuring a processor's
signal-processing performance requires more than comparing MHz, MIPS,
or number of MAC units.  With the BDTI Benchmarks, our approach is to
implement and thoroughly optimize a set of twelve benchmark functions
representing common DSP tasks.  BDTI ensures fair comparisons between
processors by enforcing strict rules regarding the optimizations that
are permitted, the amount of memory used, etc.

Our benchmark functions include not only algorithm kernels but also
all the required entry (setup) and exit (cleanup) code. In other
words, the benchmarks are complete modules that could be used directly
in real-world applications; they are not synthetic code fragments.
The overhead associated with the entry and exit code becomes
significant for some of the shorter benchmarks,especially for
processors with SIMD capabilities, just as it does in real
applications.

To understand the benchmark scores, note that some functions in the
BDTI Benchmark suite do not involve MACs at all--just as some
real-world signal processing applications functions do not involve
MACs.  These functions include supervisory control code,
bit-manipulation code, and the Viterbi decoder algorithm.

Even on MAC-intensive benchmarks, the execution times are often longer
than a simple analysis of MAC throughput would suggest.  This occurs
for various reasons including architectural limitations, memory access
latencies, and overhead associated with entry and exit code.

Taking all this into account, BDTI believes the 'TS20x BDTIsimMark2000
score accurately represents the performance of this processor in
typical 16-bit fixed-point DSP applications.

Note that the BDTImark2000/BDTIsimMark2000 only gives a first-order
picture of a processor's performance in signal processing tasks.  We
always recommend that processor users delve into more detailed
analysis when selecting a processor.  For example, obviously users
should pay close attention to individual benchmarks that resemble the
application workload, but give less weight to those benchmarks that
don't.

Further information about the BDTImark2000/BDTIsimMark2000 scores is
available at:

http://www.bdti.com/bdtimark/BDTImark2000.htm


Best Regards,

Kenton Williston
DSP Analyst                    BDTI -- Berkeley Design Technology, Inc.
williston@BDTI.com                                  http://www.BDTI.com
Phone: +1 510-665-1600                             Fax: +1 510-665-1680
-----------------------------------------------------------------------
For free DSP industry news & analysis visit www.BDTI.com/dspinsider.htm
-----------------------------------------------------------------------
BDTI: Your source for independent DSP analysis & optimized DSP software
-----------------------------------------------------------------------

"Ron Huizen" <rhuizen@bittware.com> 
> 
> Actually, we (BittWare) have long felt that many of the benchmarks (or
> benchmarketing as we like to call them) used in the world of DSP don't take
> some of these issues (like memory bandwidth) into account enough.  They
> typically measure the time to run a specific routine, like an FFT or FIR,
> but don't take into account the time to get the data into and out of the
> core.  Since many chip vendors use a 1024pt Complex FFT as a benchmark,
> we've been trying to use a measure of "Continuous FFTs" to indicate how many
> FFTs a processor (or board in our case) can do per second, assuming that you
> have to get new data into the DSP and the results out, for each FFT. We're
> also trying to use the term "Bandwith to Processing Ratio" (BPR) to give an
> indication of how "balanced" an architecture is.
> 
> It's pretty interesting to see how, with the latest generation of floating
> point DSP processors, like the TigerSharcs and G4 PowerPCs, how much the
> memory bandwidth issues come into account as opposed to number crunching
> issues.  Of course, since we sell boards based on Sharcs, and the TigerSharc
> has a very nice bandwidth to processing ratio, and hence a high continuous
> FFTs per second, we like to point this stuff out :-)  But seriously, in many
> applications, the data movement issues can dominate the system.
> 
> Have a look at
> ftp://ftp.bittware.com/documents/Articles/300MHz_TS_vs_PPC.pdf for a paper
> talking about this.
> 
> As for why the Tiger scores are lower than expected, I don't have enough
> details of the actual BDTI algorithms to see.  Maybe the
> coding/implementation wasn't optimal, or perhaps, as someone else mentioned,
> they just did 32 bit math and didn't get into making use of the 16 bit
> capabilities of the Tiger.  I wonder if anyone from BDTI watches this group?
> 
> This thread has led me to ask some friends at ADI to see what's up with the
> BDTI scores for Tiger.  If I get any relevant info back, I will post it.
> 
> ----
> Ron Huizen
> BittWare


Hi Ron,

Interesting article.

As BittWare has a long relationship with ADI, maybe you can send a
sugestion to them.
You know that when coding the ALU intructions we use some modifiers to
designate the size of the operators.
B = byte (8 bit);
S = short (16 bit);
nothing = default size (32 bit);
L = long (64 bit).

But this is not used for the multiplier operations.

R0:1 = R2*R3;; puts the 64bit result of R2 (32bit) multiplied by
R3(32bit) at the pair of registers R1/R0. Ok, no problem here, but...

R0:3=R4:5*R6:7;; could have 4 diferent meanings:
a) LR0:3 = R4:5*R6:7;; one 64bit multiplication with 128 bit result;
b)  R0:3 = R4:5*R6:7;; two 32bit multiplications with 64 bit results;
c) SR0:3 = R4:5*R6:7;; four 16bit multiplications with 64 bit results,
what is actually implemented;
d) BR0:3 = R4:5*R6:7;; eight 8bit multiplications with 16 bit results.

The TIGERSharc only can do the option c) but I think it would be
clearer to the reader if the "S" modifier were required. Maybe, in a
next release we will have the other options.

Before someone ask me why I said TIGERSharc is four times faster then
the BlackFin:
The instruction
XYMR3:0 += R0:1*R2:3;;
does eight 16bit MACS in one cycle, four in the "X" unit and four in
the "Y" unit. I think we can call it a nested SIMD architecture.

Luiz Carlos.

Andrew Reilly <andrew@gurney.reilly.home> wrote in message
news:pan.2003.08.06.01.16.02.259307@gurney.reilly.home...
> On Tue, 05 Aug 2003 03:57:30 -0700, Luiz Carlos wrote:
>
> > Does anyone have seen the BDTI score for the TigerSHARC?
> >
> > TigerSHARC (600MHz): 6280.
> > BlackFin (600MHz):   3360.
> > TMS320C64x (720Mz):  6480.
> >
> > A far as I know the TigerSHARC is almost four times faster than the
> > BlackFin and two times faster than the C64x at 16 bit operations (for
> > the same clock frequency). So, why these scores? What am I missing?
> > Was used 32 bit math for scoring the TigerSHARC?
>
> I'm not particularly familiar with any of these processors, but for any
> "real" code, of the sort that might be found in benchmarks, memory
> bandwidth and latency usually dominate execution unit performance.  Are

Actually, we (BittWare) have long felt that many of the benchmarks (or
benchmarketing as we like to call them) used in the world of DSP don't take
some of these issues (like memory bandwidth) into account enough.  They
typically measure the time to run a specific routine, like an FFT or FIR,
but don't take into account the time to get the data into and out of the
core.  Since many chip vendors use a 1024pt Complex FFT as a benchmark,
we've been trying to use a measure of "Continuous FFTs" to indicate how many
FFTs a processor (or board in our case) can do per second, assuming that you
have to get new data into the DSP and the results out, for each FFT. We're
also trying to use the term "Bandwith to Processing Ratio" (BPR) to give an
indication of how "balanced" an architecture is.

It's pretty interesting to see how, with the latest generation of floating
point DSP processors, like the TigerSharcs and G4 PowerPCs, how much the
memory bandwidth issues come into account as opposed to number crunching
issues.  Of course, since we sell boards based on Sharcs, and the TigerSharc
has a very nice bandwidth to processing ratio, and hence a high continuous
FFTs per second, we like to point this stuff out :-)  But seriously, in many
applications, the data movement issues can dominate the system.

Have a look at
ftp://ftp.bittware.com/documents/Articles/300MHz_TS_vs_PPC.pdf for a paper
talking about this.

As for why the Tiger scores are lower than expected, I don't have enough
details of the actual BDTI algorithms to see.  Maybe the
coding/implementation wasn't optimal, or perhaps, as someone else mentioned,
they just did 32 bit math and didn't get into making use of the 16 bit
capabilities of the Tiger.  I wonder if anyone from BDTI watches this group?

This thread has led me to ask some friends at ADI to see what's up with the
BDTI scores for Tiger.  If I get any relevant info back, I will post it.

----
Ron Huizen
BittWare