DSPRelated.com
Forums

efficient C64x+ code generation and DDOTPL2 instruction

Started by Jeff Brower October 4, 2010
All-

We have been unable to find a combination of C source code and compiler options that will cause the TI C64x+ compiler
to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that surprising since super-efficient MAC has
been a TI staple for many years.

Does anyone (in particular TI persons monitoring this group) know whether there is a way?

Also, is there an app note about writing optimized C source code newer than this one:

http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf

Thanks.

-Jeff

PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.

_____________________________________
Andrew-
> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
> long long _ddotpl2(long long
> src1_o:src1_e, uint src2);
>
> But it is not something that I have tried.
>
Thanks Andrew. I've seen that... but I think with intrinsics we're still unable to
reach the same level of performance as fir_r8, which is TI's C64x+ benchmark routine
for convolution. For one core, cycles for fir_r8 is on the order of:

nh * nx / 8

where nh is filter length and nx is data length. That appears to be achieved by a
"few" DDOTPL2s in parallel, plus other groups of various instructions in parallel.

-Jeff
> ------------
> From: Jeff Brower
> To: c...
> Sent: Mon, October 4, 2010 7:50:03 PM
> Subject: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>
> All-
>
> We have been unable to find a combination of C source code and compiler options
> that will cause the TI C64x+ compiler
> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that surprising
> since super-efficient MAC has
> been a TI staple for many years.
>
> Does anyone (in particular TI persons monitoring this group) know whether there is
> a way?
>
> Also, is there an app note about writing optimized C source code newer than this
> one:
>
> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>
> Thanks.
>
> -Jeff
>
> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>
Jeff,

pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
long long _ddotpl2(long long
src1_o:src1_e, uint src2);

But it is not something that I have tried.

- Andrew E.

________________________________
From: Jeff Brower
To: c...
Sent: Mon, October 4, 2010 7:50:03 PM
Subject: [c6x] efficient C64x+ code generation and DDOTPL2 instruction

All-

We have been unable to find a combination of C source code and compiler options
that will cause the TI C64x+ compiler
to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
surprising since super-efficient MAC has
been a TI staple for many years.

Does anyone (in particular TI persons monitoring this group) know whether there
is a way?

Also, is there an app note about writing optimized C source code newer than this
one:

http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf

Thanks.

-Jeff

PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
Hi Jeff,

It really seems that for the case you are talking about the compiler
intrinsics are the way to go.

If you want/need more help would you mind sharing at least the code for one
of the loops that you "think" cannot get (using intrinsics) to the
performance level you expected?

Best Regards, Laurent.

On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote:

> Andrew-
> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
> long long _ddotpl2(long long
> src1_o:src1_e, uint src2);
>
> But it is not something that I have tried.
>
> Thanks Andrew. I've seen that... but I think with intrinsics we're still
> unable to reach the same level of performance as fir_r8, which is TI's C64x+
> benchmark routine for convolution. For one core, cycles for fir_r8 is on
> the order of:
>
> nh * nx / 8
>
> where nh is filter length and nx is data length. That appears to be
> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
> instructions in parallel.
>
> -Jeff
> ------------------------------
> *From:* Jeff Brower
> *To:* c...
> *Sent:* Mon, October 4, 2010 7:50:03 PM
> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>
> All-
>
> We have been unable to find a combination of C source code and compiler
> options that will cause the TI C64x+ compiler
> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
> surprising since super-efficient MAC has
> been a TI staple for many years.
>
> Does anyone (in particular TI persons monitoring this group) know whether
> there is a way?
>
> Also, is there an app note about writing optimized C source code newer than
> this one:
>
> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>
> Thanks.
>
> -Jeff
>
> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>
> --
Laurent Gauthier

"They that can give up essential liberty to obtain a little temporary safety
deserve neither liberty nor safety."
--Benjamin Franklin, 1759
Jeff--
I suggest you write to JS @ TI to see if he has a better compiler oriented
approach for you. I know of certain cases where the hand-optimization cannot
be matched by the compiler but for most typical loops, compiler optimization
should take you fairly close to where you want to go. JS has always been a
big champion of letting the compiler work for you.

http://ewh.ieee.org/soc/cas/dallas/documents/Sem-031606-Sankaran_RTV.pdf


http://www.asicfpga.com/site_upgrade/asicfpga/pds/image_pds_files/472.pdf

https://www.cosic.esat.kuleuven.be/publications/article-674.pdf

--Bhooshan

On Thu, Oct 7, 2010 at 8:41 AM, Jeff Brower wrote:

>
> [Attachment(s) <#12b8497f2fbc313f_TopText> from Jeff Brower included
> below]
>
> Laurent-
>
> > It really seems that for the case you are talking about the compiler
> > intrinsics are the way to go.
>
> Yes, or even a function call to the benchmark routine (written in
> hand-optimized asm lang). However, we are under a
> project constraint to only use standard C code.
>
> > If you want/need more help would you mind sharing at least the code for
> one
> > of the loops that you "think" cannot get (using intrinsics) to the
> > performance level you expected?
>
> I've attached the source that we're using to test compiler optimization. It
> looks like the compiler can generate with
> a cycle count of about:
>
> nx * nh * 9
>
> and the hand-written benchmark about:
>
> nx * nh / 8
>
> where nx is data length and nh is filter length. That's a big difference,
> so we're trying to determine if the
> compiler can get closer.
>
> One note about the source: using "negative" x[] indexing produces a
> markedly slower result, so we're currently
> assuming that h[] is stored in reverse order and we would zero pad at end
> of x[].
>
> -Jeff
>
> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower >
> wrote:
> >
> >>
> >>
> >> Andrew-
> >>
> >>
> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
> >> long long _ddotpl2(long long
> >> src1_o:src1_e, uint src2);
> >>
> >> But it is not something that I have tried.
> >>
> >> Thanks Andrew. I've seen that... but I think with intrinsics we're still
> >> unable to reach the same level of performance as fir_r8, which is TI's
> C64x+
> >> benchmark routine for convolution. For one core, cycles for fir_r8 is on
> >> the order of:
> >>
> >> nh * nx / 8
> >>
> >> where nh is filter length and nx is data length. That appears to be
> >> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
> >> instructions in parallel.
> >>
> >> -Jeff
> >>
> >>
> >> ------------------------------
> >> *From:* Jeff Brower >
> >> *To:* c...
> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
> >>
> >> All-
> >>
> >> We have been unable to find a combination of C source code and compiler
> >> options that will cause the TI C64x+ compiler
> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
> >> surprising since super-efficient MAC has
> >> been a TI staple for many years.
> >>
> >> Does anyone (in particular TI persons monitoring this group) know
> whether
> >> there is a way?
> >>
> >> Also, is there an app note about writing optimized C source code newer
> than
> >> this one:
> >>
> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
> >>
> >> Thanks.
> >>
> >> -Jeff
> >>
> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
> >>
> >> --
> > Laurent Gauthier
> >
> > "They that can give up essential liberty to obtain a little temporary
> safety
> > deserve neither liberty nor safety."
> > --Benjamin Franklin, 1759
> >
>
>

--
-----------------------
"I've missed more than 9000 shots in my career.
I've lost almost 300 games. 26 times I've been trusted to take the game
winning shot and missed.
I've failed over and over again in my life.
And that is why I succeed."
-- Michael Jordan
-----------------------
Since this is a newer instruction targeting C64x+ cores, you should also
refer 64x to 64x+ software migration documents like these for clues on how
and whether you can get the compiler to generate code using DDOTPL2.

http://focus.ti.com/lit/an/spraa84a/spraa84a.pdf

Apart from this I believe TI also used to employ register TDM on their
benchmarks as opposed to loop-unrolling done by the compiler. My
understanding of the compiler of those days was that the behaviour was not
yet available thru the C compiler.

Its possible TI may release an under-development compiler to you with these
extensions, if they exist at all. Worth a shot.

--Bhooshan

On Thu, Oct 7, 2010 at 4:36 PM, Bhooshan Iyer wrote:

> Jeff--
> I suggest you write to JS @ TI to see if he has a better compiler oriented
> approach for you. I know of certain cases where the hand-optimization cannot
> be matched by the compiler but for most typical loops, compiler optimization
> should take you fairly close to where you want to go. JS has always been a
> big champion of letting the compiler work for you.
>
> http://ewh.ieee.org/soc/cas/dallas/documents/Sem-031606-Sankaran_RTV.pdf
>
>
> http://www.asicfpga.com/site_upgrade/asicfpga/pds/image_pds_files/472.pdf
>
> https://www.cosic.esat.kuleuven.be/publications/article-674.pdf
>
> --Bhooshan
>
> On Thu, Oct 7, 2010 at 8:41 AM, Jeff Brower wrote:
>
>>
>> [Attachment(s) <#12b865f7e93f0d04_12b8497f2fbc313f_TopText> from Jeff
>> Brower included below]
>>
>> Laurent-
>> > It really seems that for the case you are talking about the compiler
>> > intrinsics are the way to go.
>>
>> Yes, or even a function call to the benchmark routine (written in
>> hand-optimized asm lang). However, we are under a
>> project constraint to only use standard C code.
>> > If you want/need more help would you mind sharing at least the code for
>> one
>> > of the loops that you "think" cannot get (using intrinsics) to the
>> > performance level you expected?
>>
>> I've attached the source that we're using to test compiler optimization.
>> It looks like the compiler can generate with
>> a cycle count of about:
>>
>> nx * nh * 9
>>
>> and the hand-written benchmark about:
>>
>> nx * nh / 8
>>
>> where nx is data length and nh is filter length. That's a big difference,
>> so we're trying to determine if the
>> compiler can get closer.
>>
>> One note about the source: using "negative" x[] indexing produces a
>> markedly slower result, so we're currently
>> assuming that h[] is stored in reverse order and we would zero pad at end
>> of x[].
>>
>> -Jeff
>> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower >
>> wrote:
>> >
>> >>
>> >>
>> >> Andrew-
>> >>
>> >>
>> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> >> long long _ddotpl2(long long
>> >> src1_o:src1_e, uint src2);
>> >>
>> >> But it is not something that I have tried.
>> >>
>> >> Thanks Andrew. I've seen that... but I think with intrinsics we're
>> still
>> >> unable to reach the same level of performance as fir_r8, which is TI's
>> C64x+
>> >> benchmark routine for convolution. For one core, cycles for fir_r8 is
>> on
>> >> the order of:
>> >>
>> >> nh * nx / 8
>> >>
>> >> where nh is filter length and nx is data length. That appears to be
>> >> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>> >> instructions in parallel.
>> >>
>> >> -Jeff
>> >>
>> >>
>> >> ------------------------------
>> >> *From:* Jeff Brower
>> >
>> >> *To:* c...
>> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2
>> instruction
>> >>
>> >> All-
>> >>
>> >> We have been unable to find a combination of C source code and compiler
>> >> options that will cause the TI C64x+ compiler
>> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find
>> that
>> >> surprising since super-efficient MAC has
>> >> been a TI staple for many years.
>> >>
>> >> Does anyone (in particular TI persons monitoring this group) know
>> whether
>> >> there is a way?
>> >>
>> >> Also, is there an app note about writing optimized C source code newer
>> than
>> >> this one:
>> >>
>> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>> >>
>> >> Thanks.
>> >>
>> >> -Jeff
>> >>
>> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>> >>
>> >> --
>
Jeff, Gauthier,

The line:
sum += h[j] * x[i + j];
contains: x[i + j] where the max size of the x array is 256
where i ranges from 0 to 255
and j ranges from 0 to 15

so (as an example) when i%5 and j=1
then the referenced address is beyond the end of the x array bounds.

regarding the optimization...

my first action would be to declare h_len and x_len as 'register' so no CPU
cycles are wasted accessing the values h_len and x_len on the stack.

my second action would be to declare i and j as 'register' so no CPU cycles are
wasted accessing the values on the stack.

R. Williams

---------- Original Message -----------
From: "Jeff Brower"
To: "Laurent Gauthier"
Cc: c...
Sent: Wed, 6 Oct 2010 22:11:21 -0500 (CDT)
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction [1
Attachment]

> Laurent-
>
> > It really seems that for the case you are talking about the compiler
> > intrinsics are the way to go.
>
> Yes, or even a function call to the benchmark routine (written in hand-
> optimized asm lang). However, we are under a project constraint to
> only use standard C code.
>
> > If you want/need more help would you mind sharing at least the code for one
> > of the loops that you "think" cannot get (using intrinsics) to the
> > performance level you expected?
>
> I've attached the source that we're using to test compiler
> optimization. It looks like the compiler can generate with a cycle
> count of about:
>
> nx * nh * 9
>
> and the hand-written benchmark about:
>
> nx * nh / 8
>
> where nx is data length and nh is filter length. That's a big
> difference, so we're trying to determine if the compiler can get closer.
>
> One note about the source: using "negative" x[] indexing produces a
> markedly slower result, so we're currently assuming that h[] is stored
> in reverse order and we would zero pad at end of x[].
>
> -Jeff
>
> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote:
> >
> >>
> >>
> >> Andrew-
> >>
> >>
> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
> >> long long _ddotpl2(long long
> >> src1_o:src1_e, uint src2);
> >>
> >> But it is not something that I have tried.
> >>
> >> Thanks Andrew. I've seen that... but I think with intrinsics we're still
> >> unable to reach the same level of performance as fir_r8, which is TI's
C64x+
> >> benchmark routine for convolution. For one core, cycles for fir_r8 is on
> >> the order of:
> >>
> >> nh * nx / 8
> >>
> >> where nh is filter length and nx is data length. That appears to be
> >> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
> >> instructions in parallel.
> >>
> >> -Jeff
> >>
> >>
> >> ------------------------------
> >> *From:* Jeff Brower
> >> *To:* c...
> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
> >>
> >> All-
> >>
> >> We have been unable to find a combination of C source code and compiler
> >> options that will cause the TI C64x+ compiler
> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
> >> surprising since super-efficient MAC has
> >> been a TI staple for many years.
> >>
> >> Does anyone (in particular TI persons monitoring this group) know whether
> >> there is a way?
> >>
> >> Also, is there an app note about writing optimized C source code newer than
> >> this one:
> >>
> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
> >>
> >> Thanks.
> >>
> >> -Jeff
> >>
> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
> >>
> >> --
> > Laurent Gauthier
> >
> > "They that can give up essential liberty to obtain a little temporary safety
> > deserve neither liberty nor safety."
> > --Benjamin Franklin, 1759
> >
------- End of Original Message -------

_____________________________________
Bhooshan-

> I suggest you write to JS @ TI to see if he has a better compiler oriented
> approach for you.

Yep, already did that and got some very helpful advice.

-Jeff

> I know of certain cases where the hand-optimization cannot
> be matched by the compiler but for most typical loops, compiler optimization
> should take you fairly close to where you want to go. JS has always been a
> big champion of letting the compiler work for you.
>
> http://ewh.ieee.org/soc/cas/dallas/documents/Sem-031606-Sankaran_RTV.pdf
>
>
> http://www.asicfpga.com/site_upgrade/asicfpga/pds/image_pds_files/472.pdf
>
> https://www.cosic.esat.kuleuven.be/publications/article-674.pdf
>
> --Bhooshan
>
> On Thu, Oct 7, 2010 at 8:41 AM, Jeff Brower wrote:
>
>>
>> [Attachment(s) <#12b8497f2fbc313f_TopText> from Jeff Brower included
>> below]
>>
>> Laurent-
>>
>> > It really seems that for the case you are talking about the compiler
>> > intrinsics are the way to go.
>>
>> Yes, or even a function call to the benchmark routine (written in
>> hand-optimized asm lang). However, we are under a
>> project constraint to only use standard C code.
>>
>> > If you want/need more help would you mind sharing at least the code for
>> one
>> > of the loops that you "think" cannot get (using intrinsics) to the
>> > performance level you expected?
>>
>> I've attached the source that we're using to test compiler optimization. It
>> looks like the compiler can generate with
>> a cycle count of about:
>>
>> nx * nh * 9
>>
>> and the hand-written benchmark about:
>>
>> nx * nh / 8
>>
>> where nx is data length and nh is filter length. That's a big difference,
>> so we're trying to determine if the
>> compiler can get closer.
>>
>> One note about the source: using "negative" x[] indexing produces a
>> markedly slower result, so we're currently
>> assuming that h[] is stored in reverse order and we would zero pad at end
>> of x[].
>>
>> -Jeff
>>
>> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower >
>> wrote:
>> >
>> >>
>> >>
>> >> Andrew-
>> >>
>> >>
>> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> >> long long _ddotpl2(long long
>> >> src1_o:src1_e, uint src2);
>> >>
>> >> But it is not something that I have tried.
>> >>
>> >> Thanks Andrew. I've seen that... but I think with intrinsics we're still
>> >> unable to reach the same level of performance as fir_r8, which is TI's
>> C64x+
>> >> benchmark routine for convolution. For one core, cycles for fir_r8 is on
>> >> the order of:
>> >>
>> >> nh * nx / 8
>> >>
>> >> where nh is filter length and nx is data length. That appears to be
>> >> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>> >> instructions in parallel.
>> >>
>> >> -Jeff
>> >>
>> >>
>> >> ------------------------------
>> >> *From:* Jeff Brower >
>> >> *To:* c...
>> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>> >>
>> >> All-
>> >>
>> >> We have been unable to find a combination of C source code and compiler
>> >> options that will cause the TI C64x+ compiler
>> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
>> >> surprising since super-efficient MAC has
>> >> been a TI staple for many years.
>> >>
>> >> Does anyone (in particular TI persons monitoring this group) know
>> whether
>> >> there is a way?
>> >>
>> >> Also, is there an app note about writing optimized C source code newer
>> than
>> >> this one:
>> >>
>> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>> >>
>> >> Thanks.
>> >>
>> >> -Jeff
>> >>
>> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>> >>
>> >> --
>> > Laurent Gauthier
>> >
>> > "They that can give up essential liberty to obtain a little temporary
>> safety
>> > deserve neither liberty nor safety."
>> > --Benjamin Franklin, 1759
>> >
>>
>> --
> -----------------------
> "I've missed more than 9000 shots in my career.
> I've lost almost 300 games. 26 times I've been trusted to take the game
> winning shot and missed.
> I've failed over and over again in my life.
> And that is why I succeed."
> -- Michael Jordan
> -----------------------
>

_____________________________________
Jeff,

Are you able to make any assumptions about h_len, x_len and array alignment in
memory?

TI advise that you use
#PRAGMA MUST_ITERATE
to support compiler loop unrolling
and
_nassert(((int)x & 0x3) ==0);
to tell the compiler it can use LDDW instructions.

Of course this is no longer standard C !

- Andrew

________________________________
From: Jeff Brower
To: Laurent Gauthier
Cc: c...
Sent: Wed, October 6, 2010 11:11:21 PM
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction [1
Attachment]

[Attachment(s) from Jeff Brower included below]
Laurent-

> It really seems that for the case you are talking about the compiler
> intrinsics are the way to go.

Yes, or even a function call to the benchmark routine (written in hand-optimized
asm lang). However, we are under a
project constraint to only use standard C code.

> If you want/need more help would you mind sharing at least the code for one
> of the loops that you "think" cannot get (using intrinsics) to the
> performance level you expected?

I've attached the source that we're using to test compiler optimization. It
looks like the compiler can generate with
a cycle count of about:

nx * nh * 9

and the hand-written benchmark about:

nx * nh / 8

where nx is data length and nh is filter length. That's a big difference, so
we're trying to determine if the
compiler can get closer.

One note about the source: using "negative" x[] indexing produces a markedly
slower result, so we're currently
assuming that h[] is stored in reverse order and we would zero pad at end of
x[].

-Jeff

> On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote:
>
>> Andrew-
>> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> long long _ddotpl2(long long
>> src1_o:src1_e, uint src2);
>>
>> But it is not something that I have tried.
>>
>> Thanks Andrew. I've seen that... but I think with intrinsics we're still
>> unable to reach the same level of performance as fir_r8, which is TI's C64x+
>> benchmark routine for convolution. For one core, cycles for fir_r8 is on
>> the order of:
>>
>> nh * nx / 8
>>
>> where nh is filter length and nx is data length. That appears to be
>> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>> instructions in parallel.
>>
>> -Jeff
>> ------------------------------
>> *From:* Jeff Brower
>> *To:* c...
>> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>>
>> All-
>>
>> We have been unable to find a combination of C source code and compiler
>> options that will cause the TI C64x+ compiler
>> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
>> surprising since super-efficient MAC has
>> been a TI staple for many years.
>>
>> Does anyone (in particular TI persons monitoring this group) know whether
>> there is a way?
>>
>> Also, is there an app note about writing optimized C source code newer than
>> this one:
>>
>> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>>
>> Thanks.
>>
>> -Jeff
>>
>> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>>
>> --
> Laurent Gauthier
>
> "They that can give up essential liberty to obtain a little temporary safety
> deserve neither liberty nor safety."
> --Benjamin Franklin, 1759
>