Reply by Richard Williams●October 10, 20102010-10-10
Jeff,
For the referenced code, I can make the following comments.
1) the int variables sig_loop_lower_bound and sig_loop_upper_bound are on the
stack, are never set to any specific value, and therefore should not have any
_nassert() tests run against them
*I* would eliminate these variables and all _nassert() references to them.
2) the statements:
_nassert((int)h % 8 == 0);
_nassert((int)x % 8 == 0);
_nassert((int)y % 8 == 0);
should be moved to outside/before the outermost loop as these addresses to
arrays do not change, so only need to be checked once.
3) to assure that the h, x, y arrays are all properly aligned on a ?32? bit
boundary, a bit of work needs to be done in the .cmd linker file and the
appropriate lines added to the source, similar to:
#pragma DATA_SECTION(x, "xStartAddressMem")
short int x[X_LEN];
#pragma DATA_SECTION(h, "hStartAddressMem")
short int h[H_LEN];
#pragma DATA_SECTION(y, "yStartAddressMem")
short int y[H_LEN+X_LEN];
4) the calculation within the inner loop:
(x[i + j])
is highly undesirable and should be moved outside the inner loop,
and eliminate 'j',
similar to the following:
short int * pX;
short int * pH;
short int * pHmax = &(h[h_len]);
short int * pY = y;
.
.
.
sum = 0;
for (pH = h, pX = &x[i]; // pointer initialization
pH < (pHmax); // pointer comparison
pH++, pX++) // pointer increment
{
sum += ((*pH) * (*pX)); // where the DD instruction is used
} // end for()
*pY = (short int)(sum >> 15);
pY++;
R. Williams
---------- Original Message -----------
From: "Jeff Brower"
To: "Richard Williams"
Cc: c...
Sent: Fri, 8 Oct 2010 18:38:56 -0500 (CDT)
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
> Richard-
>
> Here is the convolution source (conv_cim.c):
>
> http://groups.yahoo.com/group/c6x/attachments/folder/1393519143/item/list
>
> Evidently Yahoo groups has a script that strips off attachments and
> stores them in a central place. But, one thing we
>
> -Jeff
>
> ------------------ Original Message ----------------
> Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
> From: "Jeff Brower"
> Date: Thu, October 7, 2010 5:11 pm
> To: "Richard Williams"
> Cc: c...
> ----------------
>
> Richard-
>
> > The line:
> > sum += h[j] * x[i + j];
> > contains: x[i + j] where the max size of the x array is 256
> > where i ranges from 0 to 255
> > and j ranges from 0 to 15
> >
> > so (as an example) when i%5 and j=1
> > then the referenced address is beyond the end of the x array bounds.
> >
> > regarding the optimization...
> >
> > my first action would be to declare h_len and x_len as 'register'
so no CPU cycles are wasted accessing the values > h_len and x_len on the stack.
> >
> > my second action would be to declare i and j as 'register' so no
CPU cycles are wasted accessing the values on the > stack.
>
> Thanks Richard -- sharp eyes as usual. We're not actually using such
> short lengths... I probably shouldn't have sent that source. I just
> sent another one, realistic for our application (attached to reply to
> Andrew). Can you see it?
>
> -Jeff
>
> > ---------- Original Message -----------
> > From: "Jeff Brower"
> > To: "Laurent Gauthier"
> > Cc: c...
> > Sent: Wed, 6 Oct 2010 22:11:21 -0500 (CDT)
> > Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
[1 Attachment] > >
> >> Laurent-
> >>
> >> > It really seems that for the case you are talking about the compiler intrinsics are the way to go. > >>
> >> Yes, or even a function call to the benchmark routine (written in hand- optimized asm lang). However, we are under > a project constraint to only use standard C code.
> >>
> >> > If you want/need more help would you mind sharing at least the code for
one of the loops that you "think" cannot > get (using intrinsics) to the performance level you
expected?
> >>
> >> I've attached the source that we're using to test compiler
> >> optimization. It looks like the compiler can generate with a cycle count
of about: > >>
> >> nx * nh * 9
> >>
> >> and the hand-written benchmark about:
> >>
> >> nx * nh / 8
> >>
> >> where nx is data length and nh is filter length. That's a big
> >> difference, so we're trying to determine if the compiler can get
closer.
> >>
> >> One note about the source: using "negative" x[] indexing produces a markedly slower result, so we're currently > assuming that h[] is stored in reverse order and we
would zero pad at
> end of x[].
> >>
> >> -Jeff
> >>
> >> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote: > >> >
> >> >>
> >> >>
> >> >> Andrew-
> >> >>
> >> >>
> >> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
> >> >> long long _ddotpl2(long long
> >> >> src1_o:src1_e, uint src2);
> >> >>
> >> >> But it is not something that I have tried.
> >> >>
> >> >> Thanks Andrew. I've seen that... but I think with intrinsics
we're still > >> >> unable to reach the same level of performance
as fir_r8, which is TI's
> > C64x+
> >> >> benchmark routine for convolution. For one core, cycles for fir_r8 is
on the order of: > >> >>
> >> >> nh * nx / 8
> >> >>
> >> >> where nh is filter length and nx is data length. That appears to be achieved by a "few" DDOTPL2s in parallel, > plus other groups of various instructions in
parallel.
> >> >>
> >> >> -Jeff
> >> >>
> >> >>
> >> >> ------------------------------
> >> >> *From:* Jeff Brower
> >> >> *To:* c...
> >> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
> >> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2
instruction
> >> >>
> >> >> All-
> >> >>
> >> >> We have been unable to find a combination of C source code and compiler
options that will cause the TI C64x+ > compiler
> >> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
surprising since super-efficient MAC has > >> >> been a TI staple for many years.
> >> >>
> >> >> Does anyone (in particular TI persons monitoring this group) know whether there is a way? > >> >>
> >> >> Also, is there an app note about writing optimized C source code newer
than this one: > >> >>
> >> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
> >> >>
> >> >> Thanks.
> >> >>
> >> >> -Jeff
> >> >>
> >> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
> >> >>
> >> >> --
> >> > Laurent Gauthier
> >> >
> >> > "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." > >> > --Benjamin Franklin, 1759
> >> >
> > ------- End of Original Message ------- ------- End of Original Message -------
_____________________________________
Reply by Jeff Brower●October 8, 20102010-10-08
All-
I just got told the "less-than asterisk greater-than" thing killed some text in
my last post. That thing is vicious.
Here is the text again... I used (*) instead.
Evidently Yahoo groups has a script that strips off attachments and stores them
in a central place. But, one thing we
noticed is that the script adds some references to the attachment prefixed by
(*) and, unfortunately, some HTML mail
clients see this as a tag and kill the references. We found this because we can
only see references to attachments in
our text-based mail clients.
-Jeff
------------------ Original Message ----------------
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
From: "Jeff Brower"
Date: Fri, October 8, 2010 6:38 pm
To: "Richard Williams"
Cc: c...
----------------
Evidently Yahoo groups has a script that strips off attachments and stores them
in a central place. But, one thing we
-Jeff
------------------ Original Message ----------------
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction From:
"Jeff Brower"
Date: Thu, October 7, 2010 5:11 pm
To: "Richard Williams"
Cc: c...
----------------
Richard-
> The line:
> sum += h[j] * x[i + j];
> contains: x[i + j] where the max size of the x array is 256
> where i ranges from 0 to 255
> and j ranges from 0 to 15
>
> so (as an example) when i%5 and j=1
> then the referenced address is beyond the end of the x array bounds.
>
> regarding the optimization...
>
> my first action would be to declare h_len and x_len as 'register' so
no CPU cycles are wasted accessing the values h_len and x_len on the stack. >
> my second action would be to declare i and j as 'register' so no CPU
cycles are wasted accessing the values on the stack.
Thanks Richard -- sharp eyes as usual. We're not actually using such short
lengths... I probably shouldn't have sent
that source. I just sent another one, realistic for our application (attached
to reply to Andrew). Can you see it?
-Jeff
> ---------- Original Message -----------
> From: "Jeff Brower"
> To: "Laurent Gauthier"
> Cc: c...
> Sent: Wed, 6 Oct 2010 22:11:21 -0500 (CDT)
> Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction [1
Attachment]
>
>> Laurent-
>>
>> > It really seems that for the case you are talking about the compiler
intrinsics are the way to go.
>>
>> Yes, or even a function call to the benchmark routine (written in hand-
optimized asm lang). However, we are under a project constraint to only use standard C code. >>
>> > If you want/need more help would you mind sharing at least the code for one
of the loops that you "think" cannot get (using intrinsics) to the performance level you expected? >>
>> I've attached the source that we're using to test compiler
>> optimization. It looks like the compiler can generate with a cycle count of
about:
>>
>> nx * nh * 9
>>
>> and the hand-written benchmark about:
>>
>> nx * nh / 8
>>
>> where nx is data length and nh is filter length. That's a big
>> difference, so we're trying to determine if the compiler can get
closer.
>>
>> One note about the source: using "negative" x[] indexing produces a markedly
slower result, so we're currently assuming that h[] is stored in reverse order and we would zero pad at end
of x[]. >>
>> -Jeff
>>
>> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote:
>> >
>> >>
>> >>
>> >> Andrew-
>> >>
>> >>
>> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> >> long long _ddotpl2(long long
>> >> src1_o:src1_e, uint src2);
>> >>
>> >> But it is not something that I have tried.
>> >>
>> >> Thanks Andrew. I've seen that... but I think with intrinsics
we're still
>> >> unable to reach the same level of performance as fir_r8, which is
TI's
> C64x+
>> >> benchmark routine for convolution. For one core, cycles for fir_r8 is on
the order of:
>> >>
>> >> nh * nx / 8
>> >>
>> >> where nh is filter length and nx is data length. That appears to be
achieved by a "few" DDOTPL2s in parallel, plus other groups of various instructions in parallel. >> >>
>> >> -Jeff
>> >>
>> >>
>> >> ------------------------------
>> >> *From:* Jeff Brower
>> >> *To:* c...
>> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2
instruction
>> >>
>> >> All-
>> >>
>> >> We have been unable to find a combination of C source code and compiler
options that will cause the TI C64x+ compiler >> >> to generate a DDOTPL2 (multiply-and-accumulate)
instruction. I find that surprising since super-efficient MAC has been a TI staple for many years. >> >>
>> >> Does anyone (in particular TI persons monitoring this group) know whether
there is a way?
>> >>
>> >> Also, is there an app note about writing optimized C source code newer
than this one:
>> >>
>> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>> >>
>> >> Thanks.
>> >>
>> >> -Jeff
>> >>
>> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>> >>
>> >> --
>> > Laurent Gauthier
>> >
>> > "They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety."
>> > --Benjamin Franklin, 1759
>> >
> ------- End of Original Message -------
Evidently Yahoo groups has a script that strips off attachments and stores them
in a central place. But, one thing we
-Jeff
------------------ Original Message ----------------
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
From: "Jeff Brower"
Date: Thu, October 7, 2010 5:11 pm
To: "Richard Williams"
Cc: c...
----------------
Richard-
> The line:
> sum += h[j] * x[i + j];
> contains: x[i + j] where the max size of the x array is 256
> where i ranges from 0 to 255
> and j ranges from 0 to 15
>
> so (as an example) when i%5 and j=1
> then the referenced address is beyond the end of the x array bounds.
>
> regarding the optimization...
>
> my first action would be to declare h_len and x_len as 'register' so
no CPU cycles are wasted accessing the values h_len and x_len on the stack. >
> my second action would be to declare i and j as 'register' so no CPU
cycles are wasted accessing the values on the stack.
Thanks Richard -- sharp eyes as usual. We're not actually using such short
lengths... I probably shouldn't have sent
that source. I just sent another one, realistic for our application (attached
to reply to Andrew). Can you see it?
-Jeff
> ---------- Original Message -----------
> From: "Jeff Brower"
> To: "Laurent Gauthier"
> Cc: c...
> Sent: Wed, 6 Oct 2010 22:11:21 -0500 (CDT)
> Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction [1
Attachment]
>
>> Laurent-
>>
>> > It really seems that for the case you are talking about the compiler
intrinsics are the way to go.
>>
>> Yes, or even a function call to the benchmark routine (written in hand-
optimized asm lang). However, we are under a project constraint to only use standard C code. >>
>> > If you want/need more help would you mind sharing at least the code for one
of the loops that you "think" cannot get (using intrinsics) to the performance level you expected? >>
>> I've attached the source that we're using to test compiler
>> optimization. It looks like the compiler can generate with a cycle count of
about:
>>
>> nx * nh * 9
>>
>> and the hand-written benchmark about:
>>
>> nx * nh / 8
>>
>> where nx is data length and nh is filter length. That's a big
>> difference, so we're trying to determine if the compiler can get
closer.
>>
>> One note about the source: using "negative" x[] indexing produces a markedly
slower result, so we're currently assuming that h[] is stored in reverse order and we would zero pad at end
of x[]. >>
>> -Jeff
>>
>> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote:
>> >
>> >>
>> >>
>> >> Andrew-
>> >>
>> >>
>> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> >> long long _ddotpl2(long long
>> >> src1_o:src1_e, uint src2);
>> >>
>> >> But it is not something that I have tried.
>> >>
>> >> Thanks Andrew. I've seen that... but I think with intrinsics
we're still
>> >> unable to reach the same level of performance as fir_r8, which is
TI's
> C64x+
>> >> benchmark routine for convolution. For one core, cycles for fir_r8 is on
the order of:
>> >>
>> >> nh * nx / 8
>> >>
>> >> where nh is filter length and nx is data length. That appears to be
achieved by a "few" DDOTPL2s in parallel, plus other groups of various instructions in parallel. >> >>
>> >> -Jeff
>> >>
>> >>
>> >> ------------------------------
>> >> *From:* Jeff Brower
>> >> *To:* c...
>> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2
instruction
>> >>
>> >> All-
>> >>
>> >> We have been unable to find a combination of C source code and compiler
options that will cause the TI C64x+ compiler >> >> to generate a DDOTPL2 (multiply-and-accumulate)
instruction. I find that surprising since super-efficient MAC has
>> >> been a TI staple for many years.
>> >>
>> >> Does anyone (in particular TI persons monitoring this group) know whether
there is a way?
>> >>
>> >> Also, is there an app note about writing optimized C source code newer
than this one:
>> >>
>> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>> >>
>> >> Thanks.
>> >>
>> >> -Jeff
>> >>
>> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>> >>
>> >> --
>> > Laurent Gauthier
>> >
>> > "They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety."
>> > --Benjamin Franklin, 1759
>> >
> ------- End of Original Message -------
_____________________________________
Reply by Vikram Ragukumar●October 7, 20102010-10-07
Jagadeesh,
> The compiler requires the use of intrinsics to match
the C64x+ _ddotpl2 instruction, please refer c6x.h include file for the expected
prototype.
>
> I have been corresponding with Jeff directly, from natural C code which is
their project's requirement, annotating the C code with _nasserts to give
the compiler
> information to unroll the loop, for example as follows:
After incorporating #pragma MUST_ITERATE and _nassert() as you
suggested, we are seeing comparable timing figures in the generated
assembly output.
Also, we now have a better understanding of how to interpret the cycle
count information generated by the compiler.
Thanks and Regards,
Vikram.
>
> /*-------------------------------*/
> /* The following assumptions are made if NOASSUME is not defined
*/
> /* It is assumed that the number of output samples>= 2. It is also
*/
> /* assumed that the number of output samples to be computed is a
*/
> /* multiple of 2. In addition it is assumed, that the number of
*/
> /* filter taps is>= 4, and a multiple of 4.
*/
> /*-------------------------------*/
>
> #ifndef NOASSUME
> _nassert(nr>= 2);
> _nassert(nr % 2 == 0);
> _nassert(nh>= 4);
> _nassert(nh % 4 == 0);
> #pragma MUST_ITERATE(4,,4);
> #endif
>
> for (j = 0; j< nr; j++)
> {
>
> /*----------------------------*/
> /* Initizlize accumulator for FIR sum to be zero.
*/
> /*----------------------------*/
>
> sum = 0;
>
> /*----------------------------*/
> /* The following assumptions are made if noassume is defined.
*/
> /* It is assumed that the input, filter and output pointers
*/
> /* are dword aligned. In addition it is assumed that the #
*/
> /* of filter taps is at least 8, or a multiple of 8.
*/
> /*----------------------------*/
>
> #ifndef NOASSUME
> _nassert((int)x % 8 == 0);
> _nassert((int)h % 8 == 0);
> _nassert((int)r % 8 == 0);
> #pragma MUST_ITERATE(8,,8);
> #endif
>
> /*---------------------------*/
> /* Compute FIR as sum of products.
*/
> /*---------------------------*/
>
> for (i = 0; i< nh; i++)
> {
> sum += x[i + j] * h[i];
> }
>
> /*---------------------------*/
> /* Shift out FIR sum and store out.
*/
> /*---------------------------*/
>
> r[j] = sum>> 15;
> }
> }
>
> When we compile with cl6x -k -o3 -mwth -mv6400 or -mv6400+ we will see 16
DOTP2's in 9 cycles,
> which achieves 32 multiplies in 9 cycles, or about 3.66 multiplies per cycle
is achievable by
> annotating natural C code.
>
> C6x Compiler generated assembly:
>
> ;** --*
> $C$L4: ; PIPED LOOP KERNEL
> ; EXCLUSIVE CPU CYCLES: 9
>
> ADD .L1 A4,A21,A21 ; |95|<0,10>
> || DOTP2 .M2X B7,A3,B6 ; |95|<0,10>
> || DOTP2 .M1X B6,A3,A4 ; |95|<0,10>
> || LDW .D2T1 *B23++,A3 ; |95|<1,1> ADD
.L1 A4,A19,A19 ; |95|<0,11>
> || ADD .L2 B8,B19,B19 ; |95|<0,11>
> || DOTP2 .M1X B6,A3,A4 ; |95|<0,11>
> || LDNDW .D1T2 *+A8(12),B7:B6 ; |95|<1,2> DOTP2
.M2X B16,A3,B8 ; |95|<0,12>
> || ADD .L2 B6,B18,B18 ; |95|<0,12>
> || DOTP2 .M1X B9,A3,A4 ; |95|<0,12>
> || ADD .L1 A4,A9,A9 ; |95|<0,12>
> || LDNDW .D1T1 *+A8(14),A5:A4 ; |95|<1,3> [ B0] BDEC
.S2 $C$L4,B0 ; |95|<0,13>
> || DOTP2 .M2X B8,A3,B7 ; |95|<0,13>
> || DOTP2 .M1X B7,A3,A22 ; |95|<0,13>
> || ADD .L1 A4,A16,A16 ; |95|<0,13>
> || ADD .L2 B6,B21,B21 ; |95|<0,13>
> || LDNDW .D1T2 *+A8(20),B9:B8 ; |95|<1,4> DOTP2
.M1X B17,A3,A4 ; |95|<0,14>
> || DOTP2 .M2X B9,A3,B6 ; |95|<0,14>
> || ADD .L1 A4,A7,A7 ; |95|<0,14>
> || ADD .L2 B6,B22,B22 ; |95|<0,14>
> || LDNDW .D1T2 *+A8(22),B7:B6 ; |95|<1,5> ADD
.L1 A4,A17,A17 ; |95|<0,15>
> || DOTP2 .M1 A23,A3,A4 ; |95|<1,6>
> || LDNDW .D1T2 *+A8(6),B7:B6 ; |95|<1,6> ADD
.L2 B8,B4,B4 ; |95|<0,16>
> || ADD .L1 A4,A6,A6 ; |95|<0,16>
> || LDNDW .D1T2 *-A8(2),B17:B16 ; |95|<1,7>
> || DOTP2 .M2X B7,A3,B8 ; |95|<1,7>
> || DOTP2 .M1 A22,A3,A4 ; |95|<1,7> ADD
.L2 B7,B5,B5 ; |95|<0,17>
> || ADD .L1 A22,A18,A18 ; |95|<0,17>
> || DOTP2 .M2X B6,A3,B6 ; |95|<1,8>
> || LDNDW .D1T2 *+A8(4),B9:B8 ; |95|<1,8>
> || DOTP2 .M1 A4,A3,A4 ; |95|<1,8> ADD
.L1 A4,A20,A20 ; |95|<0,18>
> || ADD .L2 B6,B20,B20 ; |95|<0,18>
> || DOTP2 .M1 A5,A3,A4 ; |95|<1,9>
> || DOTP2 .M2X B8,A3,B6 ; |95|<1,9>
> || LDNDW .D1T1 *A8++(4),A23:A22 ; |95|<2,0> Regards
> JS
>
> ________________________________
> From: c... [mailto:c...] On Behalf Of Bhooshan Iyer
> Sent: Thursday, October 07, 2010 6:45 AM
> To: Jeff Brower
> Cc: Laurent Gauthier; c...
> Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>
> Since this is a newer instruction targeting C64x+ cores, you should also refer
64x to 64x+ software migration documents like these for clues on how and whether
you can get the compiler to generate code using DDOTPL2.
>
> http://focus.ti.com/lit/an/spraa84a/spraa84a.pdf
>
> Apart from this I believe TI also used to employ register TDM on their
benchmarks as opposed to loop-unrolling done by the compiler. My understanding
of the compiler of those days was that the behaviour was not yet available thru
the C compiler.
>
> Its possible TI may release an under-development compiler to you with these
extensions, if they exist at all. Worth a shot.
>
> --Bhooshan
> On Thu, Oct 7, 2010 at 4:36 PM, Bhooshan
Iyer> wrote:
> Jeff--
> I suggest you write to JS @ TI to see if he has a better compiler oriented
approach for you. I know of certain cases where the hand-optimization cannot be
matched by the compiler but for most typical loops, compiler optimization should
take you fairly close to where you want to go. JS has always been a big champion
of letting the compiler work for you.
>
> http://ewh.ieee.org/soc/cas/dallas/documents/Sem-031606-Sankaran_RTV.pdf
>
>
http://www.asicfpga.com/site_upgrade/asicfpga/pds/image_pds_files/472.pdf
>
> https://www.cosic.esat.kuleuven.be/publications/article-674.pdf
>
> --Bhooshan
> On Thu, Oct 7, 2010 at 8:41 AM, Jeff
Brower> wrote:
>
> [Attachment(s) from Jeff Brower included below]
>
> Laurent-
>> It really seems that for the case you are talking about the compiler
>> intrinsics are the way to go.
> Yes, or even a function call to the benchmark routine (written in
hand-optimized asm lang). However, we are under a
> project constraint to only use standard C code.
>> If you want/need more help would you mind sharing at least the code for
one
>> of the loops that you "think" cannot get (using intrinsics) to the
>> performance level you expected?
> I've attached the source that we're using to test compiler
optimization. It looks like the compiler can generate with
> a cycle count of about:
>
> nx * nh * 9
>
> and the hand-written benchmark about:
>
> nx * nh / 8
>
> where nx is data length and nh is filter length. That's a big difference,
so we're trying to determine if the
> compiler can get closer.
>
> One note about the source: using "negative" x[] indexing produces a markedly
slower result, so we're currently
> assuming that h[] is stored in reverse order and we would zero pad at end of
x[].
>
> -Jeff
>> On Tue, Oct 5, 2010 at 4:34 AM, Jeff
Brower> wrote:
>>
>>>
>>>
>>> Andrew-
>>>
>>>
>>> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>>> long long _ddotpl2(long long
>>> src1_o:src1_e, uint src2);
>>>
>>> But it is not something that I have tried.
>>>
>>> Thanks Andrew. I've seen that... but I think with intrinsics we're
still
>>> unable to reach the same level of performance as fir_r8, which is TI's
C64x+
>>> benchmark routine for convolution. For one core, cycles for fir_r8 is on
>>> the order of:
>>>
>>> nh * nx / 8
>>>
>>> where nh is filter length and nx is data length. That appears to be
>>> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>>> instructions in parallel.
>>>
>>> -Jeff
>>>
>>>
>>> ------------------------------
>>> *From:* Jeff Brower>
>>> *To:* c...
>>> *Sent:* Mon, October 4, 2010 7:50:03 PM
>>> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>>>
>>> All-
>>>
>>> We have been unable to find a combination of C source code and compiler
>>> options that will cause the TI C64x+ compiler
>>> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
>>> surprising since super-efficient MAC has
>>> been a TI staple for many years.
>>>
>>> Does anyone (in particular TI persons monitoring this group) know whether
>>> there is a way?
>>>
>>> Also, is there an app note about writing optimized C source code newer
than
>>> this one:
>>>
>>> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>>>
>>> Thanks.
>>>
>>> -Jeff
>>>
>>> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>>>
>>> --
>
_____________________________________
Reply by Jeff Brower●October 7, 20102010-10-07
Richard-
> The line:
> sum += h[j] * x[i + j];
> contains: x[i + j] where the max size of the x array is 256
> where i ranges from 0 to 255
> and j ranges from 0 to 15
>
> so (as an example) when i%5 and j=1
> then the referenced address is beyond the end of the x array bounds.
>
> regarding the optimization...
>
> my first action would be to declare h_len and x_len as 'register' so
no CPU
> cycles are wasted accessing the values h_len and x_len on the stack.
>
> my second action would be to declare i and j as 'register' so no CPU
cycles are
> wasted accessing the values on the stack.
Thanks Richard -- sharp eyes as usual. We're not actually using such short
lengths... I probably shouldn't have sent
that source. I just sent another one, realistic for our application (attached
to reply to Andrew). Can you see it?
-Jeff
> ---------- Original Message -----------
> From: "Jeff Brower"
> To: "Laurent Gauthier"
> Cc: c...
> Sent: Wed, 6 Oct 2010 22:11:21 -0500 (CDT)
> Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
[1
> Attachment]
>
>> Laurent-
>>
>> > It really seems that for the case you are talking about the compiler
>> > intrinsics are the way to go.
>>
>> Yes, or even a function call to the benchmark routine (written in hand-
>> optimized asm lang). However, we are under a project constraint to
>> only use standard C code.
>>
>> > If you want/need more help would you mind sharing at least the code for
one
>> > of the loops that you "think" cannot get (using intrinsics) to the
>> > performance level you expected?
>>
>> I've attached the source that we're using to test compiler
>> optimization. It looks like the compiler can generate with a cycle
>> count of about:
>>
>> nx * nh * 9
>>
>> and the hand-written benchmark about:
>>
>> nx * nh / 8
>>
>> where nx is data length and nh is filter length. That's a big
>> difference, so we're trying to determine if the compiler can get
closer.
>>
>> One note about the source: using "negative" x[] indexing produces a
>> markedly slower result, so we're currently assuming that h[] is
stored
>> in reverse order and we would zero pad at end of x[].
>>
>> -Jeff
>>
>> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote:
>> >
>> >>
>> >>
>> >> Andrew-
>> >>
>> >>
>> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> >> long long _ddotpl2(long long
>> >> src1_o:src1_e, uint src2);
>> >>
>> >> But it is not something that I have tried.
>> >>
>> >> Thanks Andrew. I've seen that... but I think with intrinsics
we're still
>> >> unable to reach the same level of performance as fir_r8, which is
TI's
> C64x+
>> >> benchmark routine for convolution. For one core, cycles for fir_r8 is
on
>> >> the order of:
>> >>
>> >> nh * nx / 8
>> >>
>> >> where nh is filter length and nx is data length. That appears to be
>> >> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>> >> instructions in parallel.
>> >>
>> >> -Jeff
>> >>
>> >>
>> >> ------------------------------
>> >> *From:* Jeff Brower
>> >> *To:* c...
>> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2
instruction
>> >>
>> >> All-
>> >>
>> >> We have been unable to find a combination of C source code and compiler
>> >> options that will cause the TI C64x+ compiler
>> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find
that
>> >> surprising since super-efficient MAC has
>> >> been a TI staple for many years.
>> >>
>> >> Does anyone (in particular TI persons monitoring this group) know
whether
>> >> there is a way?
>> >>
>> >> Also, is there an app note about writing optimized C source code newer
than
>> >> this one:
>> >>
>> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>> >>
>> >> Thanks.
>> >>
>> >> -Jeff
>> >>
>> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>> >>
>> >> --
>> > Laurent Gauthier
>> >
>> > "They that can give up essential liberty to obtain a little temporary
safety
>> > deserve neither liberty nor safety."
>> > --Benjamin Franklin, 1759
>> >
> ------- End of Original Message -------
_____________________________________
Reply by Jeff Brower●October 7, 20102010-10-07
Reply by Jeff Brower●October 7, 20102010-10-07
Bhooshan-
> Since this is a newer instruction targeting C64x+
cores, you should also
> refer 64x to 64x+ software migration documents like these for clues on how
> and whether you can get the compiler to generate code using DDOTPL2.
>
> http://focus.ti.com/lit/an/spraa84a/spraa84a.pdf
>
> Apart from this I believe TI also used to employ register TDM on their
> benchmarks as opposed to loop-unrolling done by the compiler. My
> understanding of the compiler of those days was that the behaviour was not
> yet available thru the C compiler.
>
> Its possible TI may release an under-development compiler to you with these
> extensions, if they exist at all. Worth a shot.
Thanks Bhooshan, very good advice.
-Jeff
> On Thu, Oct 7, 2010 at 4:36 PM, Bhooshan Iyer
wrote:
>
>> Jeff--
>> I suggest you write to JS @ TI to see if he has a better compiler oriented
>> approach for you. I know of certain cases where the hand-optimization
cannot
>> be matched by the compiler but for most typical loops, compiler
optimization
>> should take you fairly close to where you want to go. JS has always been a
>> big champion of letting the compiler work for you.
>>
>> http://ewh.ieee.org/soc/cas/dallas/documents/Sem-031606-Sankaran_RTV.pdf
>>
>>
>> http://www.asicfpga.com/site_upgrade/asicfpga/pds/image_pds_files/472.pdf
>>
>> https://www.cosic.esat.kuleuven.be/publications/article-674.pdf
>>
>> --Bhooshan
>>
>> On Thu, Oct 7, 2010 at 8:41 AM, Jeff Brower wrote:
>>
>>>
>>> [Attachment(s) <#12b865f7e93f0d04_12b8497f2fbc313f_TopText> from Jeff
>>> Brower included below]
>>>
>>> Laurent-
>>>
>>>
>>> > It really seems that for the case you are talking about the compiler
>>> > intrinsics are the way to go.
>>>
>>> Yes, or even a function call to the benchmark routine (written in
>>> hand-optimized asm lang). However, we are under a
>>> project constraint to only use standard C code.
>>>
>>>
>>> > If you want/need more help would you mind sharing at least the code for
>>> one
>>> > of the loops that you "think" cannot get (using intrinsics) to the
>>> > performance level you expected?
>>>
>>> I've attached the source that we're using to test compiler
optimization.
>>> It looks like the compiler can generate with
>>> a cycle count of about:
>>>
>>> nx * nh * 9
>>>
>>> and the hand-written benchmark about:
>>>
>>> nx * nh / 8
>>>
>>> where nx is data length and nh is filter length. That's a big
difference,
>>> so we're trying to determine if the
>>> compiler can get closer.
>>>
>>> One note about the source: using "negative" x[] indexing produces a
>>> markedly slower result, so we're currently
>>> assuming that h[] is stored in reverse order and we would zero pad at end
>>> of x[].
>>>
>>> -Jeff
>>>
>>>
>>> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower
>
>>> wrote:
>>> >
>>> >>
>>> >>
>>> >> Andrew-
>>> >>
>>> >>
>>> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>>> >> long long _ddotpl2(long long
>>> >> src1_o:src1_e, uint src2);
>>> >>
>>> >> But it is not something that I have tried.
>>> >>
>>> >> Thanks Andrew. I've seen that... but I think with intrinsics
we're
>>> still
>>> >> unable to reach the same level of performance as fir_r8, which is
TI's
>>> C64x+
>>> >> benchmark routine for convolution. For one core, cycles for fir_r8 is
>>> on
>>> >> the order of:
>>> >>
>>> >> nh * nx / 8
>>> >>
>>> >> where nh is filter length and nx is data length. That appears to be
>>> >> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>>> >> instructions in parallel.
>>> >>
>>> >> -Jeff
>>> >>
>>> >>
>>> >> ------------------------------
>>> >> *From:* Jeff Brower
>>> >
>>> >> *To:* c...
>>> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
>>> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2
>>> instruction
>>> >>
>>> >> All-
>>> >>
>>> >> We have been unable to find a combination of C source code and
compiler
>>> >> options that will cause the TI C64x+ compiler
>>> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find
>>> that
>>> >> surprising since super-efficient MAC has
>>> >> been a TI staple for many years.
>>> >>
>>> >> Does anyone (in particular TI persons monitoring this group) know
>>> whether
>>> >> there is a way?
>>> >>
>>> >> Also, is there an app note about writing optimized C source code newer
>>> than
>>> >> this one:
>>> >>
>>> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>>> >>
>>> >> Thanks.
>>> >>
>>> >> -Jeff
>>> >>
>>> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>>> >>
>>> >> --
>>>
>
_____________________________________
Reply by Andrew Elder●October 7, 20102010-10-07
Hi Richard,
My understanding is that there is no reason to *ever* declare a variable
"register" when writing C6000 code. The compiler is smart enough to assign the
variable to a register.
Can someone please jump in and correct me if I am wrong ?
- Andrew
________________________________
From: Richard Williams
To: Jeff Brower ; Laurent Gauthier
Cc: c...
Sent: Thu, October 7, 2010 9:10:38 AM
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
Jeff, Gauthier,
The line:
sum += h[j] * x[i + j];
contains: x[i + j] where the max size of the x array is 256
where i ranges from 0 to 255
and j ranges from 0 to 15
so (as an example) when i%5 and j=1
then the referenced address is beyond the end of the x array bounds.
regarding the optimization...
my first action would be to declare h_len and x_len as 'register' so
no CPU
cycles are wasted accessing the values h_len and x_len on the stack.
my second action would be to declare i and j as 'register' so no CPU
cycles are
wasted accessing the values on the stack.
R. Williams
---------- Original Message -----------
From: "Jeff Brower"
To: "Laurent Gauthier"
Cc: c...
Sent: Wed, 6 Oct 2010 22:11:21 -0500 (CDT)
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction [1
Attachment]
> Laurent-
>
> > It really seems that for the case you are talking about the compiler
> > intrinsics are the way to go.
>
> Yes, or even a function call to the benchmark routine (written in hand-
> optimized asm lang). However, we are under a project constraint to
> only use standard C code.
>
> > If you want/need more help would you mind sharing at least the code for
one
> > of the loops that you "think" cannot get (using intrinsics) to the
> > performance level you expected?
>
> I've attached the source that we're using to test compiler
> optimization. It looks like the compiler can generate with a cycle
> count of about:
>
> nx * nh * 9
>
> and the hand-written benchmark about:
>
> nx * nh / 8
>
> where nx is data length and nh is filter length. That's a big
> difference, so we're trying to determine if the compiler can get
closer.
>
> One note about the source: using "negative" x[] indexing produces a
> markedly slower result, so we're currently assuming that h[] is stored
> in reverse order and we would zero pad at end of x[].
>
> -Jeff
>
> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote:
> >
> >>
> >>
> >> Andrew-
> >>
> >>
> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
> >> long long _ddotpl2(long long
> >> src1_o:src1_e, uint src2);
> >>
> >> But it is not something that I have tried.
> >>
> >> Thanks Andrew. I've seen that... but I think with intrinsics
we're still
> >> unable to reach the same level of performance as fir_r8, which is TI's
C64x+ > >> benchmark routine for convolution. For one core,
cycles for fir_r8 is on
> >> the order of:
> >>
> >> nh * nx / 8
> >>
> >> where nh is filter length and nx is data length. That appears to be
> >> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
> >> instructions in parallel.
> >>
> >> -Jeff
> >>
> >>
> >> ------------------------------
> >> *From:* Jeff Brower
> >> *To:* c...
> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
> >>
> >> All-
> >>
> >> We have been unable to find a combination of C source code and compiler
> >> options that will cause the TI C64x+ compiler
> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
> >> surprising since super-efficient MAC has
> >> been a TI staple for many years.
> >>
> >> Does anyone (in particular TI persons monitoring this group) know
whether
> >> there is a way?
> >>
> >> Also, is there an app note about writing optimized C source code newer
than
> >> this one:
> >>
> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
> >>
> >> Thanks.
> >>
> >> -Jeff
> >>
> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
> >>
> >> --
> > Laurent Gauthier
> >
> > "They that can give up essential liberty to obtain a little temporary
safety
> > deserve neither liberty nor safety."
> > --Benjamin Franklin, 1759
> > ------- End of Original Message -------
Reply by "Sankaran, Jagadeesh"●October 7, 20102010-10-07
The compiler requires the use of intrinsics to match the C64x+ _ddotpl2
instruction, please refer c6x.h include file for the expected prototype.
I have been corresponding with Jeff directly, from natural C code which is their
project's requirement, annotating the C code with _nasserts to give the
compiler
information to unroll the loop, for example as follows:
/*-------------------------------*/
/* The following assumptions are made if NOASSUME is not defined */
/* It is assumed that the number of output samples >= 2. It is also */
/* assumed that the number of output samples to be computed is a */
/* multiple of 2. In addition it is assumed, that the number of */
/* filter taps is >= 4, and a multiple of 4. */
/*-------------------------------*/
/*----------------------------*/
/* Initizlize accumulator for FIR sum to be zero. */
/*----------------------------*/
sum = 0;
/*----------------------------*/
/* The following assumptions are made if noassume is defined. */
/* It is assumed that the input, filter and output pointers */
/* are dword aligned. In addition it is assumed that the # */
/* of filter taps is at least 8, or a multiple of 8. */
/*----------------------------*/
/*---------------------------*/
/* Compute FIR as sum of products. */
/*---------------------------*/
for (i = 0; i < nh; i++)
{
sum += x[i + j] * h[i];
}
/*---------------------------*/
/* Shift out FIR sum and store out. */
/*---------------------------*/
r[j] = sum >> 15;
}
}
When we compile with cl6x -k -o3 -mwth -mv6400 or -mv6400+ we will see 16
DOTP2's in 9 cycles,
which achieves 32 multiplies in 9 cycles, or about 3.66 multiplies per cycle is
achievable by
annotating natural C code.
________________________________
From: c... [mailto:c...] On Behalf Of Bhooshan Iyer
Sent: Thursday, October 07, 2010 6:45 AM
To: Jeff Brower
Cc: Laurent Gauthier; c...
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
Since this is a newer instruction targeting C64x+ cores, you should also refer
64x to 64x+ software migration documents like these for clues on how and whether
you can get the compiler to generate code using DDOTPL2.
Apart from this I believe TI also used to employ register TDM on their
benchmarks as opposed to loop-unrolling done by the compiler. My understanding
of the compiler of those days was that the behaviour was not yet available thru
the C compiler.
Its possible TI may release an under-development compiler to you with these
extensions, if they exist at all. Worth a shot.
--Bhooshan
On Thu, Oct 7, 2010 at 4:36 PM, Bhooshan Iyer
> wrote:
Jeff--
I suggest you write to JS @ TI to see if he has a better compiler oriented
approach for you. I know of certain cases where the hand-optimization cannot be
matched by the compiler but for most typical loops, compiler optimization should
take you fairly close to where you want to go. JS has always been a big champion
of letting the compiler work for you.
--Bhooshan
On Thu, Oct 7, 2010 at 8:41 AM, Jeff Brower
> wrote:
[Attachment(s) from Jeff Brower included below]
Laurent- > It really seems that for the case you are talking
about the compiler
> intrinsics are the way to go. Yes, or even a function call to the benchmark routine (written in
hand-optimized asm lang). However, we are under a
project constraint to only use standard C code. > If you want/need more help would you mind sharing at
least the code for one
> of the loops that you "think" cannot get (using intrinsics) to the
> performance level you expected? I've attached the source that we're using to test compiler
optimization. It looks like the compiler can generate with
a cycle count of about:
nx * nh * 9
and the hand-written benchmark about:
nx * nh / 8
where nx is data length and nh is filter length. That's a big difference,
so we're trying to determine if the
compiler can get closer.
One note about the source: using "negative" x[] indexing produces a markedly
slower result, so we're currently
assuming that h[] is stored in reverse order and we would zero pad at end of
x[].
-Jeff > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower
> wrote:
>
>> Andrew-
>> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> long long _ddotpl2(long long
>> src1_o:src1_e, uint src2);
>>
>> But it is not something that I have tried.
>>
>> Thanks Andrew. I've seen that... but I think with intrinsics we're
still
>> unable to reach the same level of performance as fir_r8, which is TI's
C64x+
>> benchmark routine for convolution. For one core, cycles for fir_r8 is on
>> the order of:
>>
>> nh * nx / 8
>>
>> where nh is filter length and nx is data length. That appears to be
>> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>> instructions in parallel.
>>
>> -Jeff
>> ------------------------------
>> *From:* Jeff Brower >
>> *To:* c...
>> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>>
>> All-
>>
>> We have been unable to find a combination of C source code and compiler
>> options that will cause the TI C64x+ compiler
>> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
>> surprising since super-efficient MAC has
>> been a TI staple for many years.
>>
>> Does anyone (in particular TI persons monitoring this group) know whether
>> there is a way?
>>
>> Also, is there an app note about writing optimized C source code newer
than
>> this one:
>>
>> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>>
>> Thanks.
>>
>> -Jeff
>>
>> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>>
>> --
Reply by Andrew Elder●October 7, 20102010-10-07
Jeff,
Are you able to make any assumptions about h_len, x_len and array alignment in
memory?
TI advise that you use
#PRAGMA MUST_ITERATE
to support compiler loop unrolling
and
_nassert(((int)x & 0x3) ==0);
to tell the compiler it can use LDDW instructions.
Of course this is no longer standard C !
- Andrew
________________________________
From: Jeff Brower
To: Laurent Gauthier
Cc: c...
Sent: Wed, October 6, 2010 11:11:21 PM
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction [1
Attachment]
[Attachment(s) from Jeff Brower included below]
Laurent-
> It really seems that for the case you are talking
about the compiler
> intrinsics are the way to go.
Yes, or even a function call to the benchmark routine (written in hand-optimized
asm lang). However, we are under a
project constraint to only use standard C code.
> If you want/need more help would you mind sharing at
least the code for one
> of the loops that you "think" cannot get (using intrinsics) to the
> performance level you expected?
I've attached the source that we're using to test compiler
optimization. It
looks like the compiler can generate with
a cycle count of about:
nx * nh * 9
and the hand-written benchmark about:
nx * nh / 8
where nx is data length and nh is filter length. That's a big difference,
so
we're trying to determine if the
compiler can get closer.
One note about the source: using "negative" x[] indexing produces a markedly
slower result, so we're currently
assuming that h[] is stored in reverse order and we would zero pad at end of
x[].
-Jeff
> On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower
wrote:
>
>> Andrew-
>> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> long long _ddotpl2(long long
>> src1_o:src1_e, uint src2);
>>
>> But it is not something that I have tried.
>>
>> Thanks Andrew. I've seen that... but I think with intrinsics
we're still
>> unable to reach the same level of performance as fir_r8, which is TI's
C64x+
>> benchmark routine for convolution. For one core, cycles for fir_r8 is on
>> the order of:
>>
>> nh * nx / 8
>>
>> where nh is filter length and nx is data length. That appears to be
>> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>> instructions in parallel.
>>
>> -Jeff
>> ------------------------------
>> *From:* Jeff Brower
>> *To:* c...
>> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>>
>> All-
>>
>> We have been unable to find a combination of C source code and compiler
>> options that will cause the TI C64x+ compiler
>> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
>> surprising since super-efficient MAC has
>> been a TI staple for many years.
>>
>> Does anyone (in particular TI persons monitoring this group) know whether
>> there is a way?
>>
>> Also, is there an app note about writing optimized C source code newer
than
>> this one:
>>
>> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>>
>> Thanks.
>>
>> -Jeff
>>
>> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>>
>> --
> Laurent Gauthier
>
> "They that can give up essential liberty to obtain a little temporary
safety
> deserve neither liberty nor safety."
> --Benjamin Franklin, 1759
>