Reply by Andrew Reilly February 27, 20042004-02-27
On Tue, 05 Aug 2003 03:57:30 -0700, Luiz Carlos wrote:

> Does anyone have seen the BDTI score for the TigerSHARC? > > TigerSHARC (600MHz): 6280. > BlackFin (600MHz): 3360. > TMS320C64x (720Mz): 6480. > > A far as I know the TigerSHARC is almost four times faster than the > BlackFin and two times faster than the C64x at 16 bit operations (for > the same clock frequency). So, why these scores? What am I missing? > Was used 32 bit math for scoring the TigerSHARC?
I'm not particularly familiar with any of these processors, but for any "real" code, of the sort that might be found in benchmarks, memory bandwidth and latency usually dominate execution unit performance. Are you saying that the TigerSHARC has access to two times wider/faster memory than the TI? (Maybe it does, I just don't know...) -- Andrew
Reply by Kenton Williston August 17, 20032003-08-17
an2or@mailcircuit.com (Andor) wrote in message news:<ce45f9ed.0308160037.20914f7e@posting.google.com>...
> Kenton Williston wrote: > > Andor wrote: > > > Luiz Carlos wrote: > > > ... > > > > One more question. Why didn't BDTI score the TIGERSharc floating point > > > > performance? > > > > > > They scored the 21161N floating-point performance. As far as I > > > remember, the 2116x and the TS have the same floating-point cores, > > > just running at different clock rates. If their benchmark is to have > > > any utility, it must be scalable with clock rate. > ... > > Andor, > > > > The architecture of the '2116x is significantly different than that of > > the 'TS20x. Hence, it is not possible to use the '2116x score to > > project a floating-point score for the 'TS20x. > > I know that the 2116x has dual independent (SIMD) 32/40bit > floating-point units, each capable of a single cycle MAC instruction > (that results in the 400 MFLOPS continuous score) and the > multiply-add-subtract instruction (which results in the 600MFLOPS peak > score), apart from the usual single cycle > multiply/add/subtract/min/max/average etc. instructions. > > Now from the data sheet of the TigerSHARC I gather it has the same > floating-point core as the 2116x (dual 32/40 bit, single cycle MAC, > single cycle mutliply-add-subtract for each FPU). Which would mean > that, at least 32/40bit floating-point wise, the two cores are equal. > Please correct me, I am no expert on the TS, I just read the data > sheet.
There are many similarities between the '2116x and the 'TS20x, but there are also many important differences between the two architectures. I cannot go into all the details here, but let me give two simple examples: - The '2116x uses a three-stage pipeline while the 'TS20x uses a ten-stage pipline. - The '2116x has a maximum data bandwidth of 128 bits per cycle, while the 'TS20x has a maximum data bandwidth of 256 bits per cycle. Due to these and other differences, it is not possible to use the '2116x score to project a floating-point score for the 'TS20x.
> > For your information, the TS201 is available at 500 MHz and 600 MHz. > > Yeah, it says so all over the ADI webpage. But if you go and read the > latest data sheet for this processor (Rev. PrG, 6/03) and read the > ordering guide, there is only one unit available, the > ADSP-TS201SABP-ENG, with a nominal clock rate of 500 MHz. > > This wouldn't be the first time that announced clock rates weren't met > with the real product. > > Regards, > Andor
According to ADI, the TS201S is currently available at 600 MHz. I suggest you contact ADI for more information. Best Regards, Kenton Williston DSP Analyst BDTI -- Berkeley Design Technology, Inc. williston@BDTI.com http://www.BDTI.com Phone: +1 510-665-1600 Fax: +1 510-665-1680 ----------------------------------------------------------------------- For free DSP industry news & analysis visit www.BDTI.com/dspinsider.htm ----------------------------------------------------------------------- BDTI: Your source for independent DSP analysis & optimized DSP software -----------------------------------------------------------------------
Reply by Andor August 16, 20032003-08-16
Kenton Williston wrote: 
> Andor wrote: > > Luiz Carlos wrote: > > ... > > > One more question. Why didn't BDTI score the TIGERSharc floating point > > > performance? > > > > They scored the 21161N floating-point performance. As far as I > > remember, the 2116x and the TS have the same floating-point cores, > > just running at different clock rates. If their benchmark is to have > > any utility, it must be scalable with clock rate.
...
> Andor, > > The architecture of the '2116x is significantly different than that of > the 'TS20x. Hence, it is not possible to use the '2116x score to > project a floating-point score for the 'TS20x.
I know that the 2116x has dual independent (SIMD) 32/40bit floating-point units, each capable of a single cycle MAC instruction (that results in the 400 MFLOPS continuous score) and the multiply-add-subtract instruction (which results in the 600MFLOPS peak score), apart from the usual single cycle multiply/add/subtract/min/max/average etc. instructions. Now from the data sheet of the TigerSHARC I gather it has the same floating-point core as the 2116x (dual 32/40 bit, single cycle MAC, single cycle mutliply-add-subtract for each FPU). Which would mean that, at least 32/40bit floating-point wise, the two cores are equal. Please correct me, I am no expert on the TS, I just read the data sheet.
> For your information, the TS201 is available at 500 MHz and 600 MHz.
Yeah, it says so all over the ADI webpage. But if you go and read the latest data sheet for this processor (Rev. PrG, 6/03) and read the ordering guide, there is only one unit available, the ADSP-TS201SABP-ENG, with a nominal clock rate of 500 MHz. This wouldn't be the first time that announced clock rates weren't met with the real product. Regards, Andor
Reply by Kenton Williston August 15, 20032003-08-15
an2or@mailcircuit.com (Andor) wrote in message news:<ce45f9ed.0308150025.692e447a@posting.google.com>...
> Luiz Carlos wrote: > ... > > One more question. Why didn't BDTI score the TIGERSharc floating point > > performance? > > They scored the 21161N floating-point performance. As far as I > remember, the 2116x and the TS have the same floating-point cores, > just running at different clock rates. If their benchmark is to have > any utility, it must be scalable with clock rate. > > From > > http://www.bdti.com/bdtimark/chip_scores.pdf > > one sees the score of the 2116x at 100 MHz is 510, so the TS-20x at > 500 MHz (I don't see any version on any datasheet which runs at 600 > MHz) should score about 2550. > > Regards, > Andor
Andor, The architecture of the '2116x is significantly different than that of the 'TS20x. Hence, it is not possible to use the '2116x score to project a floating-point score for the 'TS20x. For your information, the TS201 is available at 500 MHz and 600 MHz. The TS202 and TS203 are available at 500 MHz. Best Regards, Kenton Williston DSP Analyst BDTI -- Berkeley Design Technology, Inc. williston@BDTI.com http://www.BDTI.com Phone: +1 510-665-1600 Fax: +1 510-665-1680 ----------------------------------------------------------------------- For free DSP industry news & analysis visit www.BDTI.com/dspinsider.htm ----------------------------------------------------------------------- BDTI: Your source for independent DSP analysis & optimized DSP software -----------------------------------------------------------------------
Reply by Andor August 15, 20032003-08-15
Luiz Carlos wrote:
...
> One more question. Why didn't BDTI score the TIGERSharc floating point > performance?
They scored the 21161N floating-point performance. As far as I remember, the 2116x and the TS have the same floating-point cores, just running at different clock rates. If their benchmark is to have any utility, it must be scalable with clock rate. From http://www.bdti.com/bdtimark/chip_scores.pdf one sees the score of the 2116x at 100 MHz is 510, so the TS-20x at 500 MHz (I don't see any version on any datasheet which runs at 600 MHz) should score about 2550. Regards, Andor
Reply by Kenton Williston August 14, 20032003-08-14
oen_br@yahoo.com.br (Luiz Carlos) wrote in message news:<8471ba54.0308140241.5daa5ec1@posting.google.com>...
> > Hi Kenton, > > It's nice to hear from BDTI. > > As I said, I don't think you are lying, but I'm surprised about and > trying to understand the TIGERSharc score. > > Everything you said makes sense but, can you be more specific about > this DSP? > BlackFin and TMS320C64x are also SIMD processors, can you give us a > piece of the test code to show us where the performance is lost? > Was the Communications Logic Unit (CLU) used for the benchmarking? > Is the test code handwritten in assembly? > > One more question. Why didn't BDTI score the TIGERSharc floating point > performance? > > Luiz Carlos
Luiz, Thanks for your interest in the scores. I understand why it seems like the TS20x should have a higher score. For example, you can see that a 600 MHz TS20x can perform 4.8 billion 16-bit MACs per second, while a 720 MHz 'C64x can perform only 2.88 billion 16-bit MACs per second. However, the TS20x is only able to realize this level of performance on some of our benchmarks. On other benchmarks, the TS20x is actually slower than the 'C64x. The reasons for this are well beyond what I can explain here. If you are interested in a detailed analysis of the TS20x performance, please contact Jeremy Giddings (our Director of Business Development) at giddings@BDTI.com or at +1 510 665 1600. Note that our forthcoming report "Buyer's Guide to DSP Processors, 2004 Edition" will include details of the TS20x benchmarks and a thorough analysis of the results. This report will be published at the end of the year. You are welcome to order a copy now. Let me answer your other questions: - The CLU is used in our benchmarks, but not every feature of the CLU is exercised by our benchmarks. We can provide analysis of these non-benchmarked features as part of a custom analysis. - Regarding the coding techniques, yes, our benchmarks are hand-coded in assembly. - We hope to benchmark the TS20x using floating-point data in the near future. I encourage you to contact ADI and let them know your interest in floating-point BDTI Benchmark results. Best Regards, Kenton Williston DSP Analyst BDTI -- Berkeley Design Technology, Inc. williston@BDTI.com http://www.BDTI.com Phone: +1 510-665-1600 Fax: +1 510-665-1680 ----------------------------------------------------------------------- For free DSP industry news & analysis visit www.BDTI.com/dspinsider.htm ----------------------------------------------------------------------- BDTI: Your source for independent DSP analysis & optimized DSP software -----------------------------------------------------------------------
Reply by Luiz Carlos August 14, 20032003-08-14
williston@bdti.com (Kenton Williston) wrote in message news:<9eaf8c13.0308131416.527c4d04@posting.google.com>...
> BDTI is an independent company, and we are zealous about performing > fair, objective benchmarking. At the same time, we work closely with > processor vendors (including Analog Devices) during the benchmarking > process to ensure that no legitimate opportunity for optimization of > the benchmark code is missed. > > As some of you already pointed out, measuring a processor's > signal-processing performance requires more than comparing MHz, MIPS, > or number of MAC units. With the BDTI Benchmarks, our approach is to > implement and thoroughly optimize a set of twelve benchmark functions > representing common DSP tasks. BDTI ensures fair comparisons between > processors by enforcing strict rules regarding the optimizations that > are permitted, the amount of memory used, etc. > > Our benchmark functions include not only algorithm kernels but also > all the required entry (setup) and exit (cleanup) code. In other > words, the benchmarks are complete modules that could be used directly > in real-world applications; they are not synthetic code fragments. > The overhead associated with the entry and exit code becomes > significant for some of the shorter benchmarks,especially for > processors with SIMD capabilities, just as it does in real > applications. > > To understand the benchmark scores, note that some functions in the > BDTI Benchmark suite do not involve MACs at all--just as some > real-world signal processing applications functions do not involve > MACs. These functions include supervisory control code, > bit-manipulation code, and the Viterbi decoder algorithm. > > Even on MAC-intensive benchmarks, the execution times are often longer > than a simple analysis of MAC throughput would suggest. This occurs > for various reasons including architectural limitations, memory access > latencies, and overhead associated with entry and exit code. > > Taking all this into account, BDTI believes the 'TS20x BDTIsimMark2000 > score accurately represents the performance of this processor in > typical 16-bit fixed-point DSP applications. > > Note that the BDTImark2000/BDTIsimMark2000 only gives a first-order > picture of a processor's performance in signal processing tasks. We > always recommend that processor users delve into more detailed > analysis when selecting a processor. For example, obviously users > should pay close attention to individual benchmarks that resemble the > application workload, but give less weight to those benchmarks that > don't. > > Further information about the BDTImark2000/BDTIsimMark2000 scores is > available at: > > http://www.bdti.com/bdtimark/BDTImark2000.htm > > > Best Regards, > > Kenton Williston > DSP Analyst BDTI -- Berkeley Design Technology, Inc. > williston@BDTI.com http://www.BDTI.com > Phone: +1 510-665-1600 Fax: +1 510-665-1680 > ----------------------------------------------------------------------- > For free DSP industry news & analysis visit www.BDTI.com/dspinsider.htm > ----------------------------------------------------------------------- > BDTI: Your source for independent DSP analysis & optimized DSP software > -----------------------------------------------------------------------
Hi Kenton, It's nice to hear from BDTI. As I said, I don't think you are lying, but I'm surprised about and trying to understand the TIGERSharc score. Everything you said makes sense but, can you be more specific about this DSP? BlackFin and TMS320C64x are also SIMD processors, can you give us a piece of the test code to show us where the performance is lost? Was the Communications Logic Unit (CLU) used for the benchmarking? Is the test code handwritten in assembly? One more question. Why didn't BDTI score the TIGERSharc floating point performance? Luiz Carlos
Reply by Kenton Williston August 13, 20032003-08-13
BDTI is an independent company, and we are zealous about performing
fair, objective benchmarking.  At the same time, we work closely with
processor vendors (including Analog Devices) during the benchmarking
process to ensure that no legitimate opportunity for optimization of
the benchmark code is missed.

As some of you already pointed out, measuring a processor's
signal-processing performance requires more than comparing MHz, MIPS,
or number of MAC units.  With the BDTI Benchmarks, our approach is to
implement and thoroughly optimize a set of twelve benchmark functions
representing common DSP tasks.  BDTI ensures fair comparisons between
processors by enforcing strict rules regarding the optimizations that
are permitted, the amount of memory used, etc.

Our benchmark functions include not only algorithm kernels but also
all the required entry (setup) and exit (cleanup) code. In other
words, the benchmarks are complete modules that could be used directly
in real-world applications; they are not synthetic code fragments.
The overhead associated with the entry and exit code becomes
significant for some of the shorter benchmarks,especially for
processors with SIMD capabilities, just as it does in real
applications.

To understand the benchmark scores, note that some functions in the
BDTI Benchmark suite do not involve MACs at all--just as some
real-world signal processing applications functions do not involve
MACs.  These functions include supervisory control code,
bit-manipulation code, and the Viterbi decoder algorithm.

Even on MAC-intensive benchmarks, the execution times are often longer
than a simple analysis of MAC throughput would suggest.  This occurs
for various reasons including architectural limitations, memory access
latencies, and overhead associated with entry and exit code.

Taking all this into account, BDTI believes the 'TS20x BDTIsimMark2000
score accurately represents the performance of this processor in
typical 16-bit fixed-point DSP applications.

Note that the BDTImark2000/BDTIsimMark2000 only gives a first-order
picture of a processor's performance in signal processing tasks.  We
always recommend that processor users delve into more detailed
analysis when selecting a processor.  For example, obviously users
should pay close attention to individual benchmarks that resemble the
application workload, but give less weight to those benchmarks that
don't.

Further information about the BDTImark2000/BDTIsimMark2000 scores is
available at:

http://www.bdti.com/bdtimark/BDTImark2000.htm


Best Regards,

Kenton Williston
DSP Analyst                    BDTI -- Berkeley Design Technology, Inc.
williston@BDTI.com                                  http://www.BDTI.com
Phone: +1 510-665-1600                             Fax: +1 510-665-1680
-----------------------------------------------------------------------
For free DSP industry news & analysis visit www.BDTI.com/dspinsider.htm
-----------------------------------------------------------------------
BDTI: Your source for independent DSP analysis & optimized DSP software
-----------------------------------------------------------------------
Reply by Luiz Carlos August 8, 20032003-08-08
"Ron Huizen" <rhuizen@bittware.com> 
> > Actually, we (BittWare) have long felt that many of the benchmarks (or > benchmarketing as we like to call them) used in the world of DSP don't take > some of these issues (like memory bandwidth) into account enough. They > typically measure the time to run a specific routine, like an FFT or FIR, > but don't take into account the time to get the data into and out of the > core. Since many chip vendors use a 1024pt Complex FFT as a benchmark, > we've been trying to use a measure of "Continuous FFTs" to indicate how many > FFTs a processor (or board in our case) can do per second, assuming that you > have to get new data into the DSP and the results out, for each FFT. We're > also trying to use the term "Bandwith to Processing Ratio" (BPR) to give an > indication of how "balanced" an architecture is. > > It's pretty interesting to see how, with the latest generation of floating > point DSP processors, like the TigerSharcs and G4 PowerPCs, how much the > memory bandwidth issues come into account as opposed to number crunching > issues. Of course, since we sell boards based on Sharcs, and the TigerSharc > has a very nice bandwidth to processing ratio, and hence a high continuous > FFTs per second, we like to point this stuff out :-) But seriously, in many > applications, the data movement issues can dominate the system. > > Have a look at > ftp://ftp.bittware.com/documents/Articles/300MHz_TS_vs_PPC.pdf for a paper > talking about this. > > As for why the Tiger scores are lower than expected, I don't have enough > details of the actual BDTI algorithms to see. Maybe the > coding/implementation wasn't optimal, or perhaps, as someone else mentioned, > they just did 32 bit math and didn't get into making use of the 16 bit > capabilities of the Tiger. I wonder if anyone from BDTI watches this group? > > This thread has led me to ask some friends at ADI to see what's up with the > BDTI scores for Tiger. If I get any relevant info back, I will post it. > > ---- > Ron Huizen > BittWare
Hi Ron, Interesting article. As BittWare has a long relationship with ADI, maybe you can send a sugestion to them. You know that when coding the ALU intructions we use some modifiers to designate the size of the operators. B = byte (8 bit); S = short (16 bit); nothing = default size (32 bit); L = long (64 bit). But this is not used for the multiplier operations. R0:1 = R2*R3;; puts the 64bit result of R2 (32bit) multiplied by R3(32bit) at the pair of registers R1/R0. Ok, no problem here, but... R0:3=R4:5*R6:7;; could have 4 diferent meanings: a) LR0:3 = R4:5*R6:7;; one 64bit multiplication with 128 bit result; b) R0:3 = R4:5*R6:7;; two 32bit multiplications with 64 bit results; c) SR0:3 = R4:5*R6:7;; four 16bit multiplications with 64 bit results, what is actually implemented; d) BR0:3 = R4:5*R6:7;; eight 8bit multiplications with 16 bit results. The TIGERSharc only can do the option c) but I think it would be clearer to the reader if the "S" modifier were required. Maybe, in a next release we will have the other options. Before someone ask me why I said TIGERSharc is four times faster then the BlackFin: The instruction XYMR3:0 += R0:1*R2:3;; does eight 16bit MACS in one cycle, four in the "X" unit and four in the "Y" unit. I think we can call it a nested SIMD architecture. Luiz Carlos.
Reply by Ron Huizen August 7, 20032003-08-07
Andrew Reilly <andrew@gurney.reilly.home> wrote in message
news:pan.2003.08.06.01.16.02.259307@gurney.reilly.home...
> On Tue, 05 Aug 2003 03:57:30 -0700, Luiz Carlos wrote: > > > Does anyone have seen the BDTI score for the TigerSHARC? > > > > TigerSHARC (600MHz): 6280. > > BlackFin (600MHz): 3360. > > TMS320C64x (720Mz): 6480. > > > > A far as I know the TigerSHARC is almost four times faster than the > > BlackFin and two times faster than the C64x at 16 bit operations (for > > the same clock frequency). So, why these scores? What am I missing? > > Was used 32 bit math for scoring the TigerSHARC? > > I'm not particularly familiar with any of these processors, but for any > "real" code, of the sort that might be found in benchmarks, memory > bandwidth and latency usually dominate execution unit performance. Are
Actually, we (BittWare) have long felt that many of the benchmarks (or benchmarketing as we like to call them) used in the world of DSP don't take some of these issues (like memory bandwidth) into account enough. They typically measure the time to run a specific routine, like an FFT or FIR, but don't take into account the time to get the data into and out of the core. Since many chip vendors use a 1024pt Complex FFT as a benchmark, we've been trying to use a measure of "Continuous FFTs" to indicate how many FFTs a processor (or board in our case) can do per second, assuming that you have to get new data into the DSP and the results out, for each FFT. We're also trying to use the term "Bandwith to Processing Ratio" (BPR) to give an indication of how "balanced" an architecture is. It's pretty interesting to see how, with the latest generation of floating point DSP processors, like the TigerSharcs and G4 PowerPCs, how much the memory bandwidth issues come into account as opposed to number crunching issues. Of course, since we sell boards based on Sharcs, and the TigerSharc has a very nice bandwidth to processing ratio, and hence a high continuous FFTs per second, we like to point this stuff out :-) But seriously, in many applications, the data movement issues can dominate the system. Have a look at ftp://ftp.bittware.com/documents/Articles/300MHz_TS_vs_PPC.pdf for a paper talking about this. As for why the Tiger scores are lower than expected, I don't have enough details of the actual BDTI algorithms to see. Maybe the coding/implementation wasn't optimal, or perhaps, as someone else mentioned, they just did 32 bit math and didn't get into making use of the 16 bit capabilities of the Tiger. I wonder if anyone from BDTI watches this group? This thread has led me to ask some friends at ADI to see what's up with the BDTI scores for Tiger. If I get any relevant info back, I will post it. ---- Ron Huizen BittWare