Tim,
I respond below to your points:
Tim Wescott <tim@wescottnospamdesign.com> writes:
> Randy Yates wrote:
>
> > Tim Wescott <tim@wescottnospamdesign.com> writes:
> >
>
> >>Randy Yates wrote:
> >>
> >>
> >>>Tim Wescott <tim@wescottnospamdesign.com> writes:
> >>>
> >>
> >>>>[...]
> >>>>Generally if you stick to pure C you are stuck with integer math.
> >>>>DSP's are designed to do fixed-radix math pretty quickly, ...
> >>>
> >>>Tim, I think most of your points are helpful, but this one is
> >>>off-the-mark
> >>
> >>>in my judgement. The typical fixed-point DSP operates much the same as
> >>>the C integer operations, performing integer math. Whether the
> >>>integers are reinterpreted to be fractional, fixed-point, or integer
> >>>is all in the interpretation and has little or nothing to do with the
> >>>implementation of the basic arithmetic operations (add, subtract,
> >>>multiply).
> >>>Of course there are differences between fixed-point DSP ALUs and the
> >>
> >>>"ALU" of a C compiler, the biggest of which are probably the wide
> >>>accumulators and the saturation options when performing various
> >>>operations. There is also the typical "left shift by 1" that a
> >>>fractional DSP does after a multiply to make the result fractional,
> >>>but that is certainly doable in C as well, albeit manually.
> >>
> >>The difference in clock ticks between implementing a fixed-point
> >>arbitrary-radix vector dot-product in assembly on a DSP and trying to
> >>do the same thing to the same precision in C on the same processor is
> >>on the order of 100:1.
> > Who said anything about a vector operation? Your statement was
>
> > Generally if you stick to pure C you are stuck with integer math.
>
> > DSP's are designed to do fixed-radix math pretty quickly, ...
> > The term "math" does not mean "vector math" in my interpretation.
>
> >
>
> >>Even on a MAC-less processor when you are in assembly and multiply two
> >>signed numbers N-bit numbers you can choose to take the lower N bits
> >>of the 2N-1-bit result as C does, or you can take the upper N-1 bits
> >>and do a shift, with way fewer clock cycles (10 or 20:1) than you
> >>could implement the same functionality in C. I should know -- I've
> >>done it in C a couple of times and in assembly on three or four
> >>different processors.
> > Apparently they did not include the TI TMS320C54x, arguably one of
>
> > the most popular DSPs around, and on that processor, the following
> > code
> > #include "dsptypes.h"
>
> > /* definitions */
>
> > #define VECTOR_LENGTH 64
>
> > /* local variables */
>
> > /* local function prototypes */
>
> > /* function definitions */
>
> > int main(int margc, char **margv) {
>
> > UINT16_T n;
> > INT16_T x[VECTOR_LENGTH];
> > INT16_T y[VECTOR_LENGTH];
> > INT32_T acc;
> > INT16_T result;
> > acc = 0;
>
> > for (n = 0; n < VECTOR_LENGTH; n++)
> > {
> > x[n] = n;
> > y[n] = VECTOR_LENGTH - n - 1;
> > }
> > acc = 0;
>
> > for (n = 0; n < VECTOR_LENGTH; n++)
> > {
> > acc += x[n] * y[n];
> > }
> > result = (INT16_T)(acc >> 16);
>
> > return result;
>
> > }
> > produces the following assembly language
>
> > 0000:0108 main
>
> > 0000:0108 4A11 PSHM 11h
> > 0000:0109 4A17 PSHM 17h
> > 0000:010A EE80 FRAME -128
> > 0000:010B E781 MVMM SP,AR1
> > 0000:010C 6DE9 MAR *+AR1(64)
> > 0000:010E E787 MVMM SP,AR7
> > 0000:010F E782 MVMM SP,AR2
> > 0000:0110 E800 LD #0h,A
> > 0000:0111 771A STM 3fh,1ah
> > 0000:0113 F072 RPTB 11ah
> > 0000:0115 L1
> > 0000:0115 8092 STL A,*AR2+
> > 0000:0116 E93F LD #3fh,B
> > 0000:0117 F520 SUB A,0,B
> > 0000:0118 8191 STL B,*AR1+
> > 0000:0119 F000 ADD #1h,0,A,A
> > 0000:011B L2
> > 0000:011B E782 MVMM SP,AR2
> > 0000:011C 6DEA MAR *+AR2(64)
> > 0000:011E E783 MVMM SP,AR3
> > 0000:011F E800 LD #0h,A
> > 0000:0120 EC3F RPT #3fh
> > 0000:0121 L3
> > 0000:0121 B089 MAC *AR2+,*AR3+,A,A
> > 0000:0122 L4
> > 0000:0122 F0E0 SFTL A,0,A
> > 0000:0123 F0F0 SFTL A,-16,A
> > 0000:0124 6BF8 ADDM 80h,*(18h)
> > 0000:0127 F495 NOP 0000:0128 F495 NOP 0000:0129 8A17
> > POPM 17h
>
> > 0000:012A 8A11 POPM 11h
> > 0000:012B F4E4 FRET Both the vector multiply and the end shift
> > look pretty damn efficient
>
> > to me, Tim. Thus even if we agree to interpret your point
> > differently, it's still
>
> > inaccurate for one of the most popular DSPs in the world.
>
> (A) That is the _only_ case that I know of for sure that the compiler
> can figure out it needs to use a MAC and shift -- the version of Code
> Composter that comes with the '2812 certainly doesn't do this, or I
> couldn't find the magic finger-ring combination.
By "case" do you mean "case of compiler version and architecture"?
Perhaps you are right. I didn't really "sneak" this combination on you
- I work with the 54x a LOT and it was the easiest thing for me to
try. And the vector multiply sample seems straightforward.
CEVA's Teak compiler produced 22 lines of assembly for the
multiply-accumulate step, so it is clearly inefficient. Due to
licensing and confidentiality issues, I'm not posting the result here.
However, the TI 5510 C compiler also produced efficient code:
;*******************************************************************************
;* TMS320C55x C/C++ Codegen Unix Version 2.40 *
;* Date/Time created: Wed Mar 9 09:07:28 2005 *
;*******************************************************************************
.mmregs
.cpl_on
.arms_on
.c54cm_off
.asg AR6, FP
.asg XAR6, XFP
.asg DPH, MDP
.model call=c55_std
.model mem=large
.noremark 5549 ; code avoids SE CPU_28
.noremark 5558 ; code avoids SE CPU_33
.noremark 5570 ; code avoids SE CPU_40
.noremark 5571 ; code avoids SE CPU_41
.noremark 5573 ; code avoids SE CPU_43
.noremark 5584 ; code avoids SE CPU_47
.noremark 5599 ; code avoids SE CPU_55
.noremark 5503 ; code avoids SE CPU_84 MMR write
.noremark 5505 ; code avoids SE CPU_84 MMR read
.noremark 5002 ; code respects overwrite rules
;*******************************************************************************
;* GLOBAL FILE PARAMETERS *
;* *
;* Architecture : TMS320C55x *
;* Optimization : Always Choose Smaller Code Size *
;* Memory : Large Model (23-Bit Data Pointers) *
;* Calls : Normal Library ASM calls *
;* Debug Info : Standard TI Debug Information *
;*******************************************************************************
.file "/home/unix/us057845/projects/simmac/ti/simmac.c"
; opt55 -m -O3 /var/tmp/aaaa0057b /var/tmp/daaa0057b -w /home/unix/us057845/projects/simmac/ti/dsp-5510/
.sect ".text"
.align 4
.global _main
.sym _main,_main, 36, 2, 0
.func 13
;----------------------------------------------------------------------
; 13 | int main(int margc, char **margv)
;----------------------------------------------------------------------
;*******************************************************************************
;* FUNCTION NAME: _main *
;* *
;* Function Uses Regs : AC0,AC0,AC1,AC1,T0,AR1,AR2,XAR2,AR3,XAR3,AR4,FP,XFP, *
;* SP,BRC0,CARRY,M40,SATA,SATD,RDM,FRCT,SMUL *
;* Save On Entry Regs : FP *
;* Stack Frame : Full (Frame Pointer in AR6, w/ debug) *
;* Total Frame Size : 131 words *
;* (2 return address/alignment) *
;* (1 frame pointer) *
;* (128 local values) *
;*******************************************************************************
;*******************************************************************************
;* *
;* Using -g (debug) with optimization (-o3) may disable key optimizations! *
;* *
;*******************************************************************************
_main:
.line 2
;----------------------------------------------------------------------
; 15 | UINT16_T n;
; 16 | INT16_T x[VECTOR_LENGTH];
; 17 | INT16_T y[VECTOR_LENGTH];
; 18 | INT32_T acc;
; 19 | INT16_T result;
; 21 | acc = 0;
;----------------------------------------------------------------------
;* T0 assigned to _margc
.sym _margc,12, 4, 17, 16
;* AR0 assigned to _margv
.sym _margv,17, 82, 17, 23
;* BRC0 assigned to L$1
;* BRC0 assigned to L$2
;* AR1 assigned to L$2
;* AR1 assigned to L$1
;* AR3 assigned to U$12
;* AR3 assigned to U$12
;* AR2 assigned to U$5
;* AR2 assigned to U$5
;* AR1 assigned to _n
.sym _n,18, 13, 4, 16
;* AC0 assigned to _acc
.sym _acc,0, 5, 4, 32
.sym _x,0, 51, 1, 1024,, 64
.sym _y,64, 51, 1, 1024,, 64
PSHBOTH XFP
ADD #-128, mmap(SP)
MOV XSP, XAR3
MOV XSP, XAR2
AMAR *+AR3(#64)
MOV XSP, XFP
.line 10
;----------------------------------------------------------------------
; 22 | for (n = 0; n < VECTOR_LENGTH; n++)
;----------------------------------------------------------------------
MOV #63, BRC0
RPTBLOCAL L2-1
|| MOV #0, AR1
; loop starts
L1:
.line 12
;----------------------------------------------------------------------
; 24 | x[n] = n;
;----------------------------------------------------------------------
MOV AR1, *AR2+ ; |24|
.line 13
;----------------------------------------------------------------------
; 25 | y[n] = VECTOR_LENGTH - n - 1;
;----------------------------------------------------------------------
MOV #63, AR4 ; |25|
SUB AR1, AR4 ; |25|
MOV AR4, *AR3+ ; |25|
.line 14
ADD #1, AR1 ; |26|
; loop ends ; |26|
L2:
MOV XSP, XAR3
MOV XSP, XAR2
AMAR *+AR3(#64)
.line 16
;----------------------------------------------------------------------
; 28 | acc = 0;
; 29 | for (n = 0; n < VECTOR_LENGTH; n++)
;----------------------------------------------------------------------
MOV #63, BRC0
RPTBLOCAL L4-1
|| MOV #0, AC0 ; |28|
; loop starts
L3:
.line 19
;----------------------------------------------------------------------
; 31 | acc += x[n] * y[n];
;----------------------------------------------------------------------
MPYM *AR3+, *AR2+, AC1 ; |31|
MOV mmap(AC1L), AC1 ; |31|
ADD AC1, AC0 ; |31|
.line 20
;----------------------------------------------------------------------
; 34 | result = (INT16_T)(acc >> 16);
;----------------------------------------------------------------------
; loop ends ; |32|
L4:
.line 24
;----------------------------------------------------------------------
; 36 | return result;
;----------------------------------------------------------------------
MOV HI(AC0), T0
.line 25
ADD #128, mmap(SP) ; |36|
POPBOTH XFP
RET ; |36|
; return occurs ; |36|
.endfunc 37,000000080h,129
;*******************************************************************************
;* TYPE INFORMATION *
;*******************************************************************************
.sym _INT16_T, 0, 3, 13, 16
.sym _UINT16_T, 0, 13, 13, 16
.sym _INT32_T, 0, 5, 13, 32
> (B) As usual TI is playing fast and loose with the ANSI standard, and
> that isn't even close to ANSI-compatible C. If it were the x[n] *
> y[n] operation would be truncated to 16 bits before being added to
> acc, and the result would be meaningless. Compile that up on machine
> that supports 16-bit and 32-bit integers, print out the results, and
> see what I mean.
According to ANSI specification ISO/IEC 9899-1999(E), the compiler is
not in violation. In a nutshell, the operation result is undefined and
so performing the operation in extended precision is acceptable. See
paragraph 2 of section 3.4.3 on "undefined behavior."
However, you're right in that what I wrote was a bit sloppy and
non-portable. The fix is very simple; change the line in question to:
acc += (INT32_T)x[n] * y[n];.
> Furthermore, you are actually making my point: by starting with an
> awareness of the one thing that sets a DSP apart from the rest and
> warping your code to fit that one thing you can make the operation
> very fast. But in production code you will have to be constantly on
> guard to make sure that the C code isn't "improved" in such a way that
> makes the compiler implement it as a bunch of "traditional" integer
> operations, thereby making it take 10-100 times slower, and likely
> reintroducing the truncation (more like 10, TI is good at making fast
> processors).
Well I'm not sure what your point is anymore. First you started out by
stating that one is stuck with integer math when using pure C. Then
you backed off and stated that your point was that vector multiplies
are inefficient. Now it seems like you're saying "they're inefficient
sometimes on some platforms." At least that seems to be the case.
The key point I wanted to make to the OP and to you is that it is
feasible to perform fixed-point arithmetic in C (whether on a DSP or
not), and that it even runs reasonably well on some platforms, i.e.
within 1000 percent of hand-tweaked assembly. (This definition of
reasonable is certainly debatable depending on the application's
requirements.)
A side issue that developed (i.e., via Jerry) seems to be whether or
not the fractional "left shift by one" is significant in the
determination to perform fixed-point processing in C. I say it is not
for the following reasons:
1. The "left shift by one" issue is not really just for "fractional"
processing but rather for any fixed-point multiply. The question is
generally whether the total dynamic range needs to be preserved or
the SNR is more important.
2. A requirement to left shift by one can be brought out of the
inner loop in MAC type of operations and thus be made negligible.
3. It is not always desirable to left shift by one and take the high
16 bits after multiplying two fixed-point 16-bit values into a
resulting 32-bit value. The N bits you take from M bits that result
from a fixed-point operation when N < M can be all over the map
depending on several application issues.
--
Randy Yates
Sony Ericsson Mobile Communications
Research Triangle Park, NC, USA
randy.yates@sonyericsson.com, 919-472-1124