DSPRelated.com
Forums

TigerSHARC BDTI score

Started by Luiz Carlos August 5, 2003
jaime.aranguren@ieee.org (Jaime Andres Aranguren Cardona) wrote
> > Good to remember that. But, any further opinion about the scores? I am > also surprised. I would have expected something better.
Good, I�m not the only one!
> > Luiz: have you asked BDTI people? > > JaaC
No Jaime, I have not. See my post below (Why I asked here)! Luiz Carlos
jaime.aranguren@ieee.org (Jaime Andres Aranguren Cardona) wrote 
> > Good to remember that. But, any further opinion about the scores? I am > also surprised. I would have expected something better. > > Luiz: have you asked BDTI people? > > JaaC
I forgot something. I don�t think the scores are biased. ADI, TI and other are said to be partners of the "BDTI Benchmark Partner Program". Let�s see: ADSP-219x (1 MAC, 160MHz), score = 420. BlackFin (2 MACs, 600MHz), score = 3360. 420/160=2,625 points/MHz. BlackFin has two MACs so: 2,625*600*2=3150 (near 3360 Ok!) C54x (1 MAC, 160MHz), score = 500. C55x (2 MACs, 300MHz), score = 1460. 500/160*2 (2MACS)*300=1875 (near 1460, Ok!) The same thing hold true for ADSP-218x versus ADSP-219x, and SHARC SISD versus SHARC SIMD. ADSP-218x and ADSP-210xx are absent in current table. I'm not sure about the C54x beeing faster than the ADSP219x, but it's not too hard to believe that. So, I think we can give some credit to the BDTI scores. But the TigerSHARC... Luiz Carlos
Andrew Reilly <andrew@gurney.reilly.home> wrote 
> > I'm not particularly familiar with any of these processors, but for any > "real" code, of the sort that might be found in benchmarks, memory > bandwidth and latency usually dominate execution unit performance. Are > you saying that the TigerSHARC has access to two times wider/faster > memory than the TI? (Maybe it does, I just don't know...)
The two C64x data buses are 64 bits wide, the TIGERSharc data buses (also two) are 128 bits wide. The TIGERSharc uses embedded DRAM (from IBM) and 128kbit caches. This (maybe) can slow it, but not cause that big difference that was shown. Luiz Carlos
Bernhard Holzmayer <holzmayer.bernhard@deadspam.com> wrote 
> Hi Luiz, > I was comparing TIs processors with Sharc some time ago, and I found > that the performance is depending on what I'm going to do with it. > Check the relevance of > - floating point processing > - 32bit/40bit width > - command pipelining > - DMA capabilities > - built-in interfaces/protocols > Depending on what you really do you'll find the one or the other > better. > If the one or the other is faster, sometimes depends on that. > If the processor has to "manually" implement what the DMA processor > of the other does in the background, this may cost you a lot of > performance. > If the command pipeline holds a complete loop and saves you memory > access this might increase speed - but what if pipelining fails in > your case? > > Without knowing your application, it's difficult to give an advice. > Nevertheless, I'd say: TigerSharc is a 32bit-DSP. > If you really need a 16bit-DSP there might be better choices out > there. > > I talked to a couple of developers who started with TI processors > (62xx/67xx) and ended up with their "better" choice Sharc. > Others seem to be happy with TI and the CodeComposer Studio. > By the way: don't forget that the programming tools are also very > important - you'll have to deal with them all the time. > > Bernhard
Hi Bernhard, I'm aware and agree whith everything you said. I&#4294967295;m not choosing (possibly, not defined yet) the TigerSHARC because it's 16 bit integer performance. My product is not to much cost constrained, so I can trade some money for "easy of use". You have used a SHARC processor, so you know how natural and pleasant is coding for it. Since almost all code I write is in assembly, this is very important. Also, it's 32 bit and floating point capabilities can speed up a lot some early implementations. Time to market. Returning to my original question. I just didn't understand that BDTI score for the TIGERSharc. And I really don't like when I don't understanding something. Luiz Carlos
We have to remember that TigerSHARC is primarily designed for
floating-point use though it can perform good fixed-point operations
too. I guess the benchmark compares only fixed-point operations and
the other two DSPs are only fixed-point DSPs.

oen_br@yahoo.com.br (Luiz Carlos) wrote in message news:<8471ba54.0308050257.5407f62a@posting.google.com>...
> Does anyone have seen the BDTI score for the TigerSHARC? > > TigerSHARC (600MHz): 6280. > BlackFin (600MHz): 3360. > TMS320C64x (720Mz): 6480. > > A far as I know the TigerSHARC is almost four times faster than the > BlackFin and two times faster than the C64x at 16 bit operations (for > the same clock frequency). So, why these scores? What am I missing? > Was used 32 bit math for scoring the TigerSHARC? > > Luiz Carlos
Andrew Reilly <andrew@gurney.reilly.home> wrote in message
news:pan.2003.08.06.01.16.02.259307@gurney.reilly.home...
> On Tue, 05 Aug 2003 03:57:30 -0700, Luiz Carlos wrote: > > > Does anyone have seen the BDTI score for the TigerSHARC? > > > > TigerSHARC (600MHz): 6280. > > BlackFin (600MHz): 3360. > > TMS320C64x (720Mz): 6480. > > > > A far as I know the TigerSHARC is almost four times faster than the > > BlackFin and two times faster than the C64x at 16 bit operations (for > > the same clock frequency). So, why these scores? What am I missing? > > Was used 32 bit math for scoring the TigerSHARC? > > I'm not particularly familiar with any of these processors, but for any > "real" code, of the sort that might be found in benchmarks, memory > bandwidth and latency usually dominate execution unit performance. Are
Actually, we (BittWare) have long felt that many of the benchmarks (or benchmarketing as we like to call them) used in the world of DSP don't take some of these issues (like memory bandwidth) into account enough. They typically measure the time to run a specific routine, like an FFT or FIR, but don't take into account the time to get the data into and out of the core. Since many chip vendors use a 1024pt Complex FFT as a benchmark, we've been trying to use a measure of "Continuous FFTs" to indicate how many FFTs a processor (or board in our case) can do per second, assuming that you have to get new data into the DSP and the results out, for each FFT. We're also trying to use the term "Bandwith to Processing Ratio" (BPR) to give an indication of how "balanced" an architecture is. It's pretty interesting to see how, with the latest generation of floating point DSP processors, like the TigerSharcs and G4 PowerPCs, how much the memory bandwidth issues come into account as opposed to number crunching issues. Of course, since we sell boards based on Sharcs, and the TigerSharc has a very nice bandwidth to processing ratio, and hence a high continuous FFTs per second, we like to point this stuff out :-) But seriously, in many applications, the data movement issues can dominate the system. Have a look at ftp://ftp.bittware.com/documents/Articles/300MHz_TS_vs_PPC.pdf for a paper talking about this. As for why the Tiger scores are lower than expected, I don't have enough details of the actual BDTI algorithms to see. Maybe the coding/implementation wasn't optimal, or perhaps, as someone else mentioned, they just did 32 bit math and didn't get into making use of the 16 bit capabilities of the Tiger. I wonder if anyone from BDTI watches this group? This thread has led me to ask some friends at ADI to see what's up with the BDTI scores for Tiger. If I get any relevant info back, I will post it. ---- Ron Huizen BittWare
"Ron Huizen" <rhuizen@bittware.com> 
> > Actually, we (BittWare) have long felt that many of the benchmarks (or > benchmarketing as we like to call them) used in the world of DSP don't take > some of these issues (like memory bandwidth) into account enough. They > typically measure the time to run a specific routine, like an FFT or FIR, > but don't take into account the time to get the data into and out of the > core. Since many chip vendors use a 1024pt Complex FFT as a benchmark, > we've been trying to use a measure of "Continuous FFTs" to indicate how many > FFTs a processor (or board in our case) can do per second, assuming that you > have to get new data into the DSP and the results out, for each FFT. We're > also trying to use the term "Bandwith to Processing Ratio" (BPR) to give an > indication of how "balanced" an architecture is. > > It's pretty interesting to see how, with the latest generation of floating > point DSP processors, like the TigerSharcs and G4 PowerPCs, how much the > memory bandwidth issues come into account as opposed to number crunching > issues. Of course, since we sell boards based on Sharcs, and the TigerSharc > has a very nice bandwidth to processing ratio, and hence a high continuous > FFTs per second, we like to point this stuff out :-) But seriously, in many > applications, the data movement issues can dominate the system. > > Have a look at > ftp://ftp.bittware.com/documents/Articles/300MHz_TS_vs_PPC.pdf for a paper > talking about this. > > As for why the Tiger scores are lower than expected, I don't have enough > details of the actual BDTI algorithms to see. Maybe the > coding/implementation wasn't optimal, or perhaps, as someone else mentioned, > they just did 32 bit math and didn't get into making use of the 16 bit > capabilities of the Tiger. I wonder if anyone from BDTI watches this group? > > This thread has led me to ask some friends at ADI to see what's up with the > BDTI scores for Tiger. If I get any relevant info back, I will post it. > > ---- > Ron Huizen > BittWare
Hi Ron, Interesting article. As BittWare has a long relationship with ADI, maybe you can send a sugestion to them. You know that when coding the ALU intructions we use some modifiers to designate the size of the operators. B = byte (8 bit); S = short (16 bit); nothing = default size (32 bit); L = long (64 bit). But this is not used for the multiplier operations. R0:1 = R2*R3;; puts the 64bit result of R2 (32bit) multiplied by R3(32bit) at the pair of registers R1/R0. Ok, no problem here, but... R0:3=R4:5*R6:7;; could have 4 diferent meanings: a) LR0:3 = R4:5*R6:7;; one 64bit multiplication with 128 bit result; b) R0:3 = R4:5*R6:7;; two 32bit multiplications with 64 bit results; c) SR0:3 = R4:5*R6:7;; four 16bit multiplications with 64 bit results, what is actually implemented; d) BR0:3 = R4:5*R6:7;; eight 8bit multiplications with 16 bit results. The TIGERSharc only can do the option c) but I think it would be clearer to the reader if the "S" modifier were required. Maybe, in a next release we will have the other options. Before someone ask me why I said TIGERSharc is four times faster then the BlackFin: The instruction XYMR3:0 += R0:1*R2:3;; does eight 16bit MACS in one cycle, four in the "X" unit and four in the "Y" unit. I think we can call it a nested SIMD architecture. Luiz Carlos.
BDTI is an independent company, and we are zealous about performing
fair, objective benchmarking.  At the same time, we work closely with
processor vendors (including Analog Devices) during the benchmarking
process to ensure that no legitimate opportunity for optimization of
the benchmark code is missed.

As some of you already pointed out, measuring a processor's
signal-processing performance requires more than comparing MHz, MIPS,
or number of MAC units.  With the BDTI Benchmarks, our approach is to
implement and thoroughly optimize a set of twelve benchmark functions
representing common DSP tasks.  BDTI ensures fair comparisons between
processors by enforcing strict rules regarding the optimizations that
are permitted, the amount of memory used, etc.

Our benchmark functions include not only algorithm kernels but also
all the required entry (setup) and exit (cleanup) code. In other
words, the benchmarks are complete modules that could be used directly
in real-world applications; they are not synthetic code fragments.
The overhead associated with the entry and exit code becomes
significant for some of the shorter benchmarks,especially for
processors with SIMD capabilities, just as it does in real
applications.

To understand the benchmark scores, note that some functions in the
BDTI Benchmark suite do not involve MACs at all--just as some
real-world signal processing applications functions do not involve
MACs.  These functions include supervisory control code,
bit-manipulation code, and the Viterbi decoder algorithm.

Even on MAC-intensive benchmarks, the execution times are often longer
than a simple analysis of MAC throughput would suggest.  This occurs
for various reasons including architectural limitations, memory access
latencies, and overhead associated with entry and exit code.

Taking all this into account, BDTI believes the 'TS20x BDTIsimMark2000
score accurately represents the performance of this processor in
typical 16-bit fixed-point DSP applications.

Note that the BDTImark2000/BDTIsimMark2000 only gives a first-order
picture of a processor's performance in signal processing tasks.  We
always recommend that processor users delve into more detailed
analysis when selecting a processor.  For example, obviously users
should pay close attention to individual benchmarks that resemble the
application workload, but give less weight to those benchmarks that
don't.

Further information about the BDTImark2000/BDTIsimMark2000 scores is
available at:

http://www.bdti.com/bdtimark/BDTImark2000.htm


Best Regards,

Kenton Williston
DSP Analyst                    BDTI -- Berkeley Design Technology, Inc.
williston@BDTI.com                                  http://www.BDTI.com
Phone: +1 510-665-1600                             Fax: +1 510-665-1680
-----------------------------------------------------------------------
For free DSP industry news & analysis visit www.BDTI.com/dspinsider.htm
-----------------------------------------------------------------------
BDTI: Your source for independent DSP analysis & optimized DSP software
-----------------------------------------------------------------------
williston@bdti.com (Kenton Williston) wrote in message news:<9eaf8c13.0308131416.527c4d04@posting.google.com>...
> BDTI is an independent company, and we are zealous about performing > fair, objective benchmarking. At the same time, we work closely with > processor vendors (including Analog Devices) during the benchmarking > process to ensure that no legitimate opportunity for optimization of > the benchmark code is missed. > > As some of you already pointed out, measuring a processor's > signal-processing performance requires more than comparing MHz, MIPS, > or number of MAC units. With the BDTI Benchmarks, our approach is to > implement and thoroughly optimize a set of twelve benchmark functions > representing common DSP tasks. BDTI ensures fair comparisons between > processors by enforcing strict rules regarding the optimizations that > are permitted, the amount of memory used, etc. > > Our benchmark functions include not only algorithm kernels but also > all the required entry (setup) and exit (cleanup) code. In other > words, the benchmarks are complete modules that could be used directly > in real-world applications; they are not synthetic code fragments. > The overhead associated with the entry and exit code becomes > significant for some of the shorter benchmarks,especially for > processors with SIMD capabilities, just as it does in real > applications. > > To understand the benchmark scores, note that some functions in the > BDTI Benchmark suite do not involve MACs at all--just as some > real-world signal processing applications functions do not involve > MACs. These functions include supervisory control code, > bit-manipulation code, and the Viterbi decoder algorithm. > > Even on MAC-intensive benchmarks, the execution times are often longer > than a simple analysis of MAC throughput would suggest. This occurs > for various reasons including architectural limitations, memory access > latencies, and overhead associated with entry and exit code. > > Taking all this into account, BDTI believes the 'TS20x BDTIsimMark2000 > score accurately represents the performance of this processor in > typical 16-bit fixed-point DSP applications. > > Note that the BDTImark2000/BDTIsimMark2000 only gives a first-order > picture of a processor's performance in signal processing tasks. We > always recommend that processor users delve into more detailed > analysis when selecting a processor. For example, obviously users > should pay close attention to individual benchmarks that resemble the > application workload, but give less weight to those benchmarks that > don't. > > Further information about the BDTImark2000/BDTIsimMark2000 scores is > available at: > > http://www.bdti.com/bdtimark/BDTImark2000.htm > > > Best Regards, > > Kenton Williston > DSP Analyst BDTI -- Berkeley Design Technology, Inc. > williston@BDTI.com http://www.BDTI.com > Phone: +1 510-665-1600 Fax: +1 510-665-1680 > ----------------------------------------------------------------------- > For free DSP industry news & analysis visit www.BDTI.com/dspinsider.htm > ----------------------------------------------------------------------- > BDTI: Your source for independent DSP analysis & optimized DSP software > -----------------------------------------------------------------------
Hi Kenton, It's nice to hear from BDTI. As I said, I don't think you are lying, but I'm surprised about and trying to understand the TIGERSharc score. Everything you said makes sense but, can you be more specific about this DSP? BlackFin and TMS320C64x are also SIMD processors, can you give us a piece of the test code to show us where the performance is lost? Was the Communications Logic Unit (CLU) used for the benchmarking? Is the test code handwritten in assembly? One more question. Why didn't BDTI score the TIGERSharc floating point performance? Luiz Carlos
oen_br@yahoo.com.br (Luiz Carlos) wrote in message news:<8471ba54.0308140241.5daa5ec1@posting.google.com>...
> > Hi Kenton, > > It's nice to hear from BDTI. > > As I said, I don't think you are lying, but I'm surprised about and > trying to understand the TIGERSharc score. > > Everything you said makes sense but, can you be more specific about > this DSP? > BlackFin and TMS320C64x are also SIMD processors, can you give us a > piece of the test code to show us where the performance is lost? > Was the Communications Logic Unit (CLU) used for the benchmarking? > Is the test code handwritten in assembly? > > One more question. Why didn't BDTI score the TIGERSharc floating point > performance? > > Luiz Carlos
Luiz, Thanks for your interest in the scores. I understand why it seems like the TS20x should have a higher score. For example, you can see that a 600 MHz TS20x can perform 4.8 billion 16-bit MACs per second, while a 720 MHz 'C64x can perform only 2.88 billion 16-bit MACs per second. However, the TS20x is only able to realize this level of performance on some of our benchmarks. On other benchmarks, the TS20x is actually slower than the 'C64x. The reasons for this are well beyond what I can explain here. If you are interested in a detailed analysis of the TS20x performance, please contact Jeremy Giddings (our Director of Business Development) at giddings@BDTI.com or at +1 510 665 1600. Note that our forthcoming report "Buyer's Guide to DSP Processors, 2004 Edition" will include details of the TS20x benchmarks and a thorough analysis of the results. This report will be published at the end of the year. You are welcome to order a copy now. Let me answer your other questions: - The CLU is used in our benchmarks, but not every feature of the CLU is exercised by our benchmarks. We can provide analysis of these non-benchmarked features as part of a custom analysis. - Regarding the coding techniques, yes, our benchmarks are hand-coded in assembly. - We hope to benchmark the TS20x using floating-point data in the near future. I encourage you to contact ADI and let them know your interest in floating-point BDTI Benchmark results. Best Regards, Kenton Williston DSP Analyst BDTI -- Berkeley Design Technology, Inc. williston@BDTI.com http://www.BDTI.com Phone: +1 510-665-1600 Fax: +1 510-665-1680 ----------------------------------------------------------------------- For free DSP industry news & analysis visit www.BDTI.com/dspinsider.htm ----------------------------------------------------------------------- BDTI: Your source for independent DSP analysis & optimized DSP software -----------------------------------------------------------------------