All-
We have been unable to find a combination of C source code and compiler options
that will cause the TI C64x+ compiler
to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
surprising since super-efficient MAC has
been a TI staple for many years.
Does anyone (in particular TI persons monitoring this group) know whether there
is a way?
Also, is there an app note about writing optimized C source code newer than this
one:
http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
Thanks.
-Jeff
PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
_____________________________________
efficient C64x+ code generation and DDOTPL2 instruction
Started by ●October 4, 2010
Reply by ●October 4, 20102010-10-04
Andrew-
> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
> long long _ddotpl2(long long
> src1_o:src1_e, uint src2);
>
> But it is not something that I have tried.
>
Thanks Andrew. I've seen that... but I think with intrinsics we're still unable to
reach the same level of performance as fir_r8, which is TI's C64x+ benchmark routine
for convolution. For one core, cycles for fir_r8 is on the order of:
nh * nx / 8
where nh is filter length and nx is data length. That appears to be achieved by a
"few" DDOTPL2s in parallel, plus other groups of various instructions in parallel.
-Jeff
> ------------
> From: Jeff Brower
> To: c...
> Sent: Mon, October 4, 2010 7:50:03 PM
> Subject: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>
> All-
>
> We have been unable to find a combination of C source code and compiler options
> that will cause the TI C64x+ compiler
> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that surprising
> since super-efficient MAC has
> been a TI staple for many years.
>
> Does anyone (in particular TI persons monitoring this group) know whether there is
> a way?
>
> Also, is there an app note about writing optimized C source code newer than this
> one:
>
> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>
> Thanks.
>
> -Jeff
>
> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>
> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
> long long _ddotpl2(long long
> src1_o:src1_e, uint src2);
>
> But it is not something that I have tried.
>
Thanks Andrew. I've seen that... but I think with intrinsics we're still unable to
reach the same level of performance as fir_r8, which is TI's C64x+ benchmark routine
for convolution. For one core, cycles for fir_r8 is on the order of:
nh * nx / 8
where nh is filter length and nx is data length. That appears to be achieved by a
"few" DDOTPL2s in parallel, plus other groups of various instructions in parallel.
-Jeff
> ------------
> From: Jeff Brower
> To: c...
> Sent: Mon, October 4, 2010 7:50:03 PM
> Subject: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>
> All-
>
> We have been unable to find a combination of C source code and compiler options
> that will cause the TI C64x+ compiler
> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that surprising
> since super-efficient MAC has
> been a TI staple for many years.
>
> Does anyone (in particular TI persons monitoring this group) know whether there is
> a way?
>
> Also, is there an app note about writing optimized C source code newer than this
> one:
>
> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>
> Thanks.
>
> -Jeff
>
> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>
Reply by ●October 5, 20102010-10-05
Jeff,
pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
long long _ddotpl2(long long
src1_o:src1_e, uint src2);
But it is not something that I have tried.
- Andrew E.
________________________________
From: Jeff Brower
To: c...
Sent: Mon, October 4, 2010 7:50:03 PM
Subject: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
All-
We have been unable to find a combination of C source code and compiler options
that will cause the TI C64x+ compiler
to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
surprising since super-efficient MAC has
been a TI staple for many years.
Does anyone (in particular TI persons monitoring this group) know whether there
is a way?
Also, is there an app note about writing optimized C source code newer than this
one:
http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
Thanks.
-Jeff
PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
long long _ddotpl2(long long
src1_o:src1_e, uint src2);
But it is not something that I have tried.
- Andrew E.
________________________________
From: Jeff Brower
To: c...
Sent: Mon, October 4, 2010 7:50:03 PM
Subject: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
All-
We have been unable to find a combination of C source code and compiler options
that will cause the TI C64x+ compiler
to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
surprising since super-efficient MAC has
been a TI staple for many years.
Does anyone (in particular TI persons monitoring this group) know whether there
is a way?
Also, is there an app note about writing optimized C source code newer than this
one:
http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
Thanks.
-Jeff
PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
Reply by ●October 5, 20102010-10-05
Hi Jeff,
It really seems that for the case you are talking about the compiler
intrinsics are the way to go.
If you want/need more help would you mind sharing at least the code for one
of the loops that you "think" cannot get (using intrinsics) to the
performance level you expected?
Best Regards, Laurent.
On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote:
> Andrew-
> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
> long long _ddotpl2(long long
> src1_o:src1_e, uint src2);
>
> But it is not something that I have tried.
>
> Thanks Andrew. I've seen that... but I think with intrinsics we're still
> unable to reach the same level of performance as fir_r8, which is TI's C64x+
> benchmark routine for convolution. For one core, cycles for fir_r8 is on
> the order of:
>
> nh * nx / 8
>
> where nh is filter length and nx is data length. That appears to be
> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
> instructions in parallel.
>
> -Jeff
> ------------------------------
> *From:* Jeff Brower
> *To:* c...
> *Sent:* Mon, October 4, 2010 7:50:03 PM
> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>
> All-
>
> We have been unable to find a combination of C source code and compiler
> options that will cause the TI C64x+ compiler
> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
> surprising since super-efficient MAC has
> been a TI staple for many years.
>
> Does anyone (in particular TI persons monitoring this group) know whether
> there is a way?
>
> Also, is there an app note about writing optimized C source code newer than
> this one:
>
> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>
> Thanks.
>
> -Jeff
>
> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>
> --
Laurent Gauthier
"They that can give up essential liberty to obtain a little temporary safety
deserve neither liberty nor safety."
--Benjamin Franklin, 1759
It really seems that for the case you are talking about the compiler
intrinsics are the way to go.
If you want/need more help would you mind sharing at least the code for one
of the loops that you "think" cannot get (using intrinsics) to the
performance level you expected?
Best Regards, Laurent.
On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote:
> Andrew-
> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
> long long _ddotpl2(long long
> src1_o:src1_e, uint src2);
>
> But it is not something that I have tried.
>
> Thanks Andrew. I've seen that... but I think with intrinsics we're still
> unable to reach the same level of performance as fir_r8, which is TI's C64x+
> benchmark routine for convolution. For one core, cycles for fir_r8 is on
> the order of:
>
> nh * nx / 8
>
> where nh is filter length and nx is data length. That appears to be
> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
> instructions in parallel.
>
> -Jeff
> ------------------------------
> *From:* Jeff Brower
> *To:* c...
> *Sent:* Mon, October 4, 2010 7:50:03 PM
> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>
> All-
>
> We have been unable to find a combination of C source code and compiler
> options that will cause the TI C64x+ compiler
> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
> surprising since super-efficient MAC has
> been a TI staple for many years.
>
> Does anyone (in particular TI persons monitoring this group) know whether
> there is a way?
>
> Also, is there an app note about writing optimized C source code newer than
> this one:
>
> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>
> Thanks.
>
> -Jeff
>
> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>
> --
Laurent Gauthier
"They that can give up essential liberty to obtain a little temporary safety
deserve neither liberty nor safety."
--Benjamin Franklin, 1759
Reply by ●October 6, 20102010-10-06
Reply by ●October 7, 20102010-10-07
Jeff--
I suggest you write to JS @ TI to see if he has a better compiler oriented
approach for you. I know of certain cases where the hand-optimization cannot
be matched by the compiler but for most typical loops, compiler optimization
should take you fairly close to where you want to go. JS has always been a
big champion of letting the compiler work for you.
http://ewh.ieee.org/soc/cas/dallas/documents/Sem-031606-Sankaran_RTV.pdf
http://www.asicfpga.com/site_upgrade/asicfpga/pds/image_pds_files/472.pdf
https://www.cosic.esat.kuleuven.be/publications/article-674.pdf
--Bhooshan
On Thu, Oct 7, 2010 at 8:41 AM, Jeff Brower wrote:
>
> [Attachment(s) <#12b8497f2fbc313f_TopText> from Jeff Brower included
> below]
>
> Laurent-
>
> > It really seems that for the case you are talking about the compiler
> > intrinsics are the way to go.
>
> Yes, or even a function call to the benchmark routine (written in
> hand-optimized asm lang). However, we are under a
> project constraint to only use standard C code.
>
> > If you want/need more help would you mind sharing at least the code for
> one
> > of the loops that you "think" cannot get (using intrinsics) to the
> > performance level you expected?
>
> I've attached the source that we're using to test compiler optimization. It
> looks like the compiler can generate with
> a cycle count of about:
>
> nx * nh * 9
>
> and the hand-written benchmark about:
>
> nx * nh / 8
>
> where nx is data length and nh is filter length. That's a big difference,
> so we're trying to determine if the
> compiler can get closer.
>
> One note about the source: using "negative" x[] indexing produces a
> markedly slower result, so we're currently
> assuming that h[] is stored in reverse order and we would zero pad at end
> of x[].
>
> -Jeff
>
> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower >
> wrote:
> >
> >>
> >>
> >> Andrew-
> >>
> >>
> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
> >> long long _ddotpl2(long long
> >> src1_o:src1_e, uint src2);
> >>
> >> But it is not something that I have tried.
> >>
> >> Thanks Andrew. I've seen that... but I think with intrinsics we're still
> >> unable to reach the same level of performance as fir_r8, which is TI's
> C64x+
> >> benchmark routine for convolution. For one core, cycles for fir_r8 is on
> >> the order of:
> >>
> >> nh * nx / 8
> >>
> >> where nh is filter length and nx is data length. That appears to be
> >> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
> >> instructions in parallel.
> >>
> >> -Jeff
> >>
> >>
> >> ------------------------------
> >> *From:* Jeff Brower >
> >> *To:* c...
> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
> >>
> >> All-
> >>
> >> We have been unable to find a combination of C source code and compiler
> >> options that will cause the TI C64x+ compiler
> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
> >> surprising since super-efficient MAC has
> >> been a TI staple for many years.
> >>
> >> Does anyone (in particular TI persons monitoring this group) know
> whether
> >> there is a way?
> >>
> >> Also, is there an app note about writing optimized C source code newer
> than
> >> this one:
> >>
> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
> >>
> >> Thanks.
> >>
> >> -Jeff
> >>
> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
> >>
> >> --
> > Laurent Gauthier
> >
> > "They that can give up essential liberty to obtain a little temporary
> safety
> > deserve neither liberty nor safety."
> > --Benjamin Franklin, 1759
> >
>
>
--
-----------------------
"I've missed more than 9000 shots in my career.
I've lost almost 300 games. 26 times I've been trusted to take the game
winning shot and missed.
I've failed over and over again in my life.
And that is why I succeed."
-- Michael Jordan
-----------------------
I suggest you write to JS @ TI to see if he has a better compiler oriented
approach for you. I know of certain cases where the hand-optimization cannot
be matched by the compiler but for most typical loops, compiler optimization
should take you fairly close to where you want to go. JS has always been a
big champion of letting the compiler work for you.
http://ewh.ieee.org/soc/cas/dallas/documents/Sem-031606-Sankaran_RTV.pdf
http://www.asicfpga.com/site_upgrade/asicfpga/pds/image_pds_files/472.pdf
https://www.cosic.esat.kuleuven.be/publications/article-674.pdf
--Bhooshan
On Thu, Oct 7, 2010 at 8:41 AM, Jeff Brower wrote:
>
> [Attachment(s) <#12b8497f2fbc313f_TopText> from Jeff Brower included
> below]
>
> Laurent-
>
> > It really seems that for the case you are talking about the compiler
> > intrinsics are the way to go.
>
> Yes, or even a function call to the benchmark routine (written in
> hand-optimized asm lang). However, we are under a
> project constraint to only use standard C code.
>
> > If you want/need more help would you mind sharing at least the code for
> one
> > of the loops that you "think" cannot get (using intrinsics) to the
> > performance level you expected?
>
> I've attached the source that we're using to test compiler optimization. It
> looks like the compiler can generate with
> a cycle count of about:
>
> nx * nh * 9
>
> and the hand-written benchmark about:
>
> nx * nh / 8
>
> where nx is data length and nh is filter length. That's a big difference,
> so we're trying to determine if the
> compiler can get closer.
>
> One note about the source: using "negative" x[] indexing produces a
> markedly slower result, so we're currently
> assuming that h[] is stored in reverse order and we would zero pad at end
> of x[].
>
> -Jeff
>
> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower >
> wrote:
> >
> >>
> >>
> >> Andrew-
> >>
> >>
> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
> >> long long _ddotpl2(long long
> >> src1_o:src1_e, uint src2);
> >>
> >> But it is not something that I have tried.
> >>
> >> Thanks Andrew. I've seen that... but I think with intrinsics we're still
> >> unable to reach the same level of performance as fir_r8, which is TI's
> C64x+
> >> benchmark routine for convolution. For one core, cycles for fir_r8 is on
> >> the order of:
> >>
> >> nh * nx / 8
> >>
> >> where nh is filter length and nx is data length. That appears to be
> >> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
> >> instructions in parallel.
> >>
> >> -Jeff
> >>
> >>
> >> ------------------------------
> >> *From:* Jeff Brower >
> >> *To:* c...
> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
> >>
> >> All-
> >>
> >> We have been unable to find a combination of C source code and compiler
> >> options that will cause the TI C64x+ compiler
> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
> >> surprising since super-efficient MAC has
> >> been a TI staple for many years.
> >>
> >> Does anyone (in particular TI persons monitoring this group) know
> whether
> >> there is a way?
> >>
> >> Also, is there an app note about writing optimized C source code newer
> than
> >> this one:
> >>
> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
> >>
> >> Thanks.
> >>
> >> -Jeff
> >>
> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
> >>
> >> --
> > Laurent Gauthier
> >
> > "They that can give up essential liberty to obtain a little temporary
> safety
> > deserve neither liberty nor safety."
> > --Benjamin Franklin, 1759
> >
>
>
--
-----------------------
"I've missed more than 9000 shots in my career.
I've lost almost 300 games. 26 times I've been trusted to take the game
winning shot and missed.
I've failed over and over again in my life.
And that is why I succeed."
-- Michael Jordan
-----------------------
Reply by ●October 7, 20102010-10-07
Since this is a newer instruction targeting C64x+ cores, you should also
refer 64x to 64x+ software migration documents like these for clues on how
and whether you can get the compiler to generate code using DDOTPL2.
http://focus.ti.com/lit/an/spraa84a/spraa84a.pdf
Apart from this I believe TI also used to employ register TDM on their
benchmarks as opposed to loop-unrolling done by the compiler. My
understanding of the compiler of those days was that the behaviour was not
yet available thru the C compiler.
Its possible TI may release an under-development compiler to you with these
extensions, if they exist at all. Worth a shot.
--Bhooshan
On Thu, Oct 7, 2010 at 4:36 PM, Bhooshan Iyer wrote:
> Jeff--
> I suggest you write to JS @ TI to see if he has a better compiler oriented
> approach for you. I know of certain cases where the hand-optimization cannot
> be matched by the compiler but for most typical loops, compiler optimization
> should take you fairly close to where you want to go. JS has always been a
> big champion of letting the compiler work for you.
>
> http://ewh.ieee.org/soc/cas/dallas/documents/Sem-031606-Sankaran_RTV.pdf
>
>
> http://www.asicfpga.com/site_upgrade/asicfpga/pds/image_pds_files/472.pdf
>
> https://www.cosic.esat.kuleuven.be/publications/article-674.pdf
>
> --Bhooshan
>
> On Thu, Oct 7, 2010 at 8:41 AM, Jeff Brower wrote:
>
>>
>> [Attachment(s) <#12b865f7e93f0d04_12b8497f2fbc313f_TopText> from Jeff
>> Brower included below]
>>
>> Laurent-
>> > It really seems that for the case you are talking about the compiler
>> > intrinsics are the way to go.
>>
>> Yes, or even a function call to the benchmark routine (written in
>> hand-optimized asm lang). However, we are under a
>> project constraint to only use standard C code.
>> > If you want/need more help would you mind sharing at least the code for
>> one
>> > of the loops that you "think" cannot get (using intrinsics) to the
>> > performance level you expected?
>>
>> I've attached the source that we're using to test compiler optimization.
>> It looks like the compiler can generate with
>> a cycle count of about:
>>
>> nx * nh * 9
>>
>> and the hand-written benchmark about:
>>
>> nx * nh / 8
>>
>> where nx is data length and nh is filter length. That's a big difference,
>> so we're trying to determine if the
>> compiler can get closer.
>>
>> One note about the source: using "negative" x[] indexing produces a
>> markedly slower result, so we're currently
>> assuming that h[] is stored in reverse order and we would zero pad at end
>> of x[].
>>
>> -Jeff
>> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower >
>> wrote:
>> >
>> >>
>> >>
>> >> Andrew-
>> >>
>> >>
>> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> >> long long _ddotpl2(long long
>> >> src1_o:src1_e, uint src2);
>> >>
>> >> But it is not something that I have tried.
>> >>
>> >> Thanks Andrew. I've seen that... but I think with intrinsics we're
>> still
>> >> unable to reach the same level of performance as fir_r8, which is TI's
>> C64x+
>> >> benchmark routine for convolution. For one core, cycles for fir_r8 is
>> on
>> >> the order of:
>> >>
>> >> nh * nx / 8
>> >>
>> >> where nh is filter length and nx is data length. That appears to be
>> >> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>> >> instructions in parallel.
>> >>
>> >> -Jeff
>> >>
>> >>
>> >> ------------------------------
>> >> *From:* Jeff Brower
>> >
>> >> *To:* c...
>> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2
>> instruction
>> >>
>> >> All-
>> >>
>> >> We have been unable to find a combination of C source code and compiler
>> >> options that will cause the TI C64x+ compiler
>> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find
>> that
>> >> surprising since super-efficient MAC has
>> >> been a TI staple for many years.
>> >>
>> >> Does anyone (in particular TI persons monitoring this group) know
>> whether
>> >> there is a way?
>> >>
>> >> Also, is there an app note about writing optimized C source code newer
>> than
>> >> this one:
>> >>
>> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>> >>
>> >> Thanks.
>> >>
>> >> -Jeff
>> >>
>> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>> >>
>> >> --
>
refer 64x to 64x+ software migration documents like these for clues on how
and whether you can get the compiler to generate code using DDOTPL2.
http://focus.ti.com/lit/an/spraa84a/spraa84a.pdf
Apart from this I believe TI also used to employ register TDM on their
benchmarks as opposed to loop-unrolling done by the compiler. My
understanding of the compiler of those days was that the behaviour was not
yet available thru the C compiler.
Its possible TI may release an under-development compiler to you with these
extensions, if they exist at all. Worth a shot.
--Bhooshan
On Thu, Oct 7, 2010 at 4:36 PM, Bhooshan Iyer wrote:
> Jeff--
> I suggest you write to JS @ TI to see if he has a better compiler oriented
> approach for you. I know of certain cases where the hand-optimization cannot
> be matched by the compiler but for most typical loops, compiler optimization
> should take you fairly close to where you want to go. JS has always been a
> big champion of letting the compiler work for you.
>
> http://ewh.ieee.org/soc/cas/dallas/documents/Sem-031606-Sankaran_RTV.pdf
>
>
> http://www.asicfpga.com/site_upgrade/asicfpga/pds/image_pds_files/472.pdf
>
> https://www.cosic.esat.kuleuven.be/publications/article-674.pdf
>
> --Bhooshan
>
> On Thu, Oct 7, 2010 at 8:41 AM, Jeff Brower wrote:
>
>>
>> [Attachment(s) <#12b865f7e93f0d04_12b8497f2fbc313f_TopText> from Jeff
>> Brower included below]
>>
>> Laurent-
>> > It really seems that for the case you are talking about the compiler
>> > intrinsics are the way to go.
>>
>> Yes, or even a function call to the benchmark routine (written in
>> hand-optimized asm lang). However, we are under a
>> project constraint to only use standard C code.
>> > If you want/need more help would you mind sharing at least the code for
>> one
>> > of the loops that you "think" cannot get (using intrinsics) to the
>> > performance level you expected?
>>
>> I've attached the source that we're using to test compiler optimization.
>> It looks like the compiler can generate with
>> a cycle count of about:
>>
>> nx * nh * 9
>>
>> and the hand-written benchmark about:
>>
>> nx * nh / 8
>>
>> where nx is data length and nh is filter length. That's a big difference,
>> so we're trying to determine if the
>> compiler can get closer.
>>
>> One note about the source: using "negative" x[] indexing produces a
>> markedly slower result, so we're currently
>> assuming that h[] is stored in reverse order and we would zero pad at end
>> of x[].
>>
>> -Jeff
>> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower >
>> wrote:
>> >
>> >>
>> >>
>> >> Andrew-
>> >>
>> >>
>> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> >> long long _ddotpl2(long long
>> >> src1_o:src1_e, uint src2);
>> >>
>> >> But it is not something that I have tried.
>> >>
>> >> Thanks Andrew. I've seen that... but I think with intrinsics we're
>> still
>> >> unable to reach the same level of performance as fir_r8, which is TI's
>> C64x+
>> >> benchmark routine for convolution. For one core, cycles for fir_r8 is
>> on
>> >> the order of:
>> >>
>> >> nh * nx / 8
>> >>
>> >> where nh is filter length and nx is data length. That appears to be
>> >> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>> >> instructions in parallel.
>> >>
>> >> -Jeff
>> >>
>> >>
>> >> ------------------------------
>> >> *From:* Jeff Brower
>> >
>> >> *To:* c...
>> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2
>> instruction
>> >>
>> >> All-
>> >>
>> >> We have been unable to find a combination of C source code and compiler
>> >> options that will cause the TI C64x+ compiler
>> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find
>> that
>> >> surprising since super-efficient MAC has
>> >> been a TI staple for many years.
>> >>
>> >> Does anyone (in particular TI persons monitoring this group) know
>> whether
>> >> there is a way?
>> >>
>> >> Also, is there an app note about writing optimized C source code newer
>> than
>> >> this one:
>> >>
>> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>> >>
>> >> Thanks.
>> >>
>> >> -Jeff
>> >>
>> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>> >>
>> >> --
>
Reply by ●October 7, 20102010-10-07
Jeff, Gauthier,
The line:
sum += h[j] * x[i + j];
contains: x[i + j] where the max size of the x array is 256
where i ranges from 0 to 255
and j ranges from 0 to 15
so (as an example) when i%5 and j=1
then the referenced address is beyond the end of the x array bounds.
regarding the optimization...
my first action would be to declare h_len and x_len as 'register' so no CPU
cycles are wasted accessing the values h_len and x_len on the stack.
my second action would be to declare i and j as 'register' so no CPU cycles are
wasted accessing the values on the stack.
R. Williams
---------- Original Message -----------
From: "Jeff Brower"
To: "Laurent Gauthier"
Cc: c...
Sent: Wed, 6 Oct 2010 22:11:21 -0500 (CDT)
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction [1
Attachment]
> Laurent-
>
> > It really seems that for the case you are talking about the compiler
> > intrinsics are the way to go.
>
> Yes, or even a function call to the benchmark routine (written in hand-
> optimized asm lang). However, we are under a project constraint to
> only use standard C code.
>
> > If you want/need more help would you mind sharing at least the code for one
> > of the loops that you "think" cannot get (using intrinsics) to the
> > performance level you expected?
>
> I've attached the source that we're using to test compiler
> optimization. It looks like the compiler can generate with a cycle
> count of about:
>
> nx * nh * 9
>
> and the hand-written benchmark about:
>
> nx * nh / 8
>
> where nx is data length and nh is filter length. That's a big
> difference, so we're trying to determine if the compiler can get closer.
>
> One note about the source: using "negative" x[] indexing produces a
> markedly slower result, so we're currently assuming that h[] is stored
> in reverse order and we would zero pad at end of x[].
>
> -Jeff
>
> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote:
> >
> >>
> >>
> >> Andrew-
> >>
> >>
> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
> >> long long _ddotpl2(long long
> >> src1_o:src1_e, uint src2);
> >>
> >> But it is not something that I have tried.
> >>
> >> Thanks Andrew. I've seen that... but I think with intrinsics we're still
> >> unable to reach the same level of performance as fir_r8, which is TI's
C64x+
> >> benchmark routine for convolution. For one core, cycles for fir_r8 is on
> >> the order of:
> >>
> >> nh * nx / 8
> >>
> >> where nh is filter length and nx is data length. That appears to be
> >> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
> >> instructions in parallel.
> >>
> >> -Jeff
> >>
> >>
> >> ------------------------------
> >> *From:* Jeff Brower
> >> *To:* c...
> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
> >>
> >> All-
> >>
> >> We have been unable to find a combination of C source code and compiler
> >> options that will cause the TI C64x+ compiler
> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
> >> surprising since super-efficient MAC has
> >> been a TI staple for many years.
> >>
> >> Does anyone (in particular TI persons monitoring this group) know whether
> >> there is a way?
> >>
> >> Also, is there an app note about writing optimized C source code newer than
> >> this one:
> >>
> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
> >>
> >> Thanks.
> >>
> >> -Jeff
> >>
> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
> >>
> >> --
> > Laurent Gauthier
> >
> > "They that can give up essential liberty to obtain a little temporary safety
> > deserve neither liberty nor safety."
> > --Benjamin Franklin, 1759
> >
------- End of Original Message -------
_____________________________________
The line:
sum += h[j] * x[i + j];
contains: x[i + j] where the max size of the x array is 256
where i ranges from 0 to 255
and j ranges from 0 to 15
so (as an example) when i%5 and j=1
then the referenced address is beyond the end of the x array bounds.
regarding the optimization...
my first action would be to declare h_len and x_len as 'register' so no CPU
cycles are wasted accessing the values h_len and x_len on the stack.
my second action would be to declare i and j as 'register' so no CPU cycles are
wasted accessing the values on the stack.
R. Williams
---------- Original Message -----------
From: "Jeff Brower"
To: "Laurent Gauthier"
Cc: c...
Sent: Wed, 6 Oct 2010 22:11:21 -0500 (CDT)
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction [1
Attachment]
> Laurent-
>
> > It really seems that for the case you are talking about the compiler
> > intrinsics are the way to go.
>
> Yes, or even a function call to the benchmark routine (written in hand-
> optimized asm lang). However, we are under a project constraint to
> only use standard C code.
>
> > If you want/need more help would you mind sharing at least the code for one
> > of the loops that you "think" cannot get (using intrinsics) to the
> > performance level you expected?
>
> I've attached the source that we're using to test compiler
> optimization. It looks like the compiler can generate with a cycle
> count of about:
>
> nx * nh * 9
>
> and the hand-written benchmark about:
>
> nx * nh / 8
>
> where nx is data length and nh is filter length. That's a big
> difference, so we're trying to determine if the compiler can get closer.
>
> One note about the source: using "negative" x[] indexing produces a
> markedly slower result, so we're currently assuming that h[] is stored
> in reverse order and we would zero pad at end of x[].
>
> -Jeff
>
> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote:
> >
> >>
> >>
> >> Andrew-
> >>
> >>
> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
> >> long long _ddotpl2(long long
> >> src1_o:src1_e, uint src2);
> >>
> >> But it is not something that I have tried.
> >>
> >> Thanks Andrew. I've seen that... but I think with intrinsics we're still
> >> unable to reach the same level of performance as fir_r8, which is TI's
C64x+
> >> benchmark routine for convolution. For one core, cycles for fir_r8 is on
> >> the order of:
> >>
> >> nh * nx / 8
> >>
> >> where nh is filter length and nx is data length. That appears to be
> >> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
> >> instructions in parallel.
> >>
> >> -Jeff
> >>
> >>
> >> ------------------------------
> >> *From:* Jeff Brower
> >> *To:* c...
> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
> >>
> >> All-
> >>
> >> We have been unable to find a combination of C source code and compiler
> >> options that will cause the TI C64x+ compiler
> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
> >> surprising since super-efficient MAC has
> >> been a TI staple for many years.
> >>
> >> Does anyone (in particular TI persons monitoring this group) know whether
> >> there is a way?
> >>
> >> Also, is there an app note about writing optimized C source code newer than
> >> this one:
> >>
> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
> >>
> >> Thanks.
> >>
> >> -Jeff
> >>
> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
> >>
> >> --
> > Laurent Gauthier
> >
> > "They that can give up essential liberty to obtain a little temporary safety
> > deserve neither liberty nor safety."
> > --Benjamin Franklin, 1759
> >
------- End of Original Message -------
_____________________________________
Reply by ●October 7, 20102010-10-07
Bhooshan-
> I suggest you write to JS @ TI to see if he has a better compiler oriented
> approach for you.
Yep, already did that and got some very helpful advice.
-Jeff
> I know of certain cases where the hand-optimization cannot
> be matched by the compiler but for most typical loops, compiler optimization
> should take you fairly close to where you want to go. JS has always been a
> big champion of letting the compiler work for you.
>
> http://ewh.ieee.org/soc/cas/dallas/documents/Sem-031606-Sankaran_RTV.pdf
>
>
> http://www.asicfpga.com/site_upgrade/asicfpga/pds/image_pds_files/472.pdf
>
> https://www.cosic.esat.kuleuven.be/publications/article-674.pdf
>
> --Bhooshan
>
> On Thu, Oct 7, 2010 at 8:41 AM, Jeff Brower wrote:
>
>>
>> [Attachment(s) <#12b8497f2fbc313f_TopText> from Jeff Brower included
>> below]
>>
>> Laurent-
>>
>> > It really seems that for the case you are talking about the compiler
>> > intrinsics are the way to go.
>>
>> Yes, or even a function call to the benchmark routine (written in
>> hand-optimized asm lang). However, we are under a
>> project constraint to only use standard C code.
>>
>> > If you want/need more help would you mind sharing at least the code for
>> one
>> > of the loops that you "think" cannot get (using intrinsics) to the
>> > performance level you expected?
>>
>> I've attached the source that we're using to test compiler optimization. It
>> looks like the compiler can generate with
>> a cycle count of about:
>>
>> nx * nh * 9
>>
>> and the hand-written benchmark about:
>>
>> nx * nh / 8
>>
>> where nx is data length and nh is filter length. That's a big difference,
>> so we're trying to determine if the
>> compiler can get closer.
>>
>> One note about the source: using "negative" x[] indexing produces a
>> markedly slower result, so we're currently
>> assuming that h[] is stored in reverse order and we would zero pad at end
>> of x[].
>>
>> -Jeff
>>
>> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower >
>> wrote:
>> >
>> >>
>> >>
>> >> Andrew-
>> >>
>> >>
>> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> >> long long _ddotpl2(long long
>> >> src1_o:src1_e, uint src2);
>> >>
>> >> But it is not something that I have tried.
>> >>
>> >> Thanks Andrew. I've seen that... but I think with intrinsics we're still
>> >> unable to reach the same level of performance as fir_r8, which is TI's
>> C64x+
>> >> benchmark routine for convolution. For one core, cycles for fir_r8 is on
>> >> the order of:
>> >>
>> >> nh * nx / 8
>> >>
>> >> where nh is filter length and nx is data length. That appears to be
>> >> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>> >> instructions in parallel.
>> >>
>> >> -Jeff
>> >>
>> >>
>> >> ------------------------------
>> >> *From:* Jeff Brower >
>> >> *To:* c...
>> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>> >>
>> >> All-
>> >>
>> >> We have been unable to find a combination of C source code and compiler
>> >> options that will cause the TI C64x+ compiler
>> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
>> >> surprising since super-efficient MAC has
>> >> been a TI staple for many years.
>> >>
>> >> Does anyone (in particular TI persons monitoring this group) know
>> whether
>> >> there is a way?
>> >>
>> >> Also, is there an app note about writing optimized C source code newer
>> than
>> >> this one:
>> >>
>> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>> >>
>> >> Thanks.
>> >>
>> >> -Jeff
>> >>
>> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>> >>
>> >> --
>> > Laurent Gauthier
>> >
>> > "They that can give up essential liberty to obtain a little temporary
>> safety
>> > deserve neither liberty nor safety."
>> > --Benjamin Franklin, 1759
>> >
>>
>> --
> -----------------------
> "I've missed more than 9000 shots in my career.
> I've lost almost 300 games. 26 times I've been trusted to take the game
> winning shot and missed.
> I've failed over and over again in my life.
> And that is why I succeed."
> -- Michael Jordan
> -----------------------
>
_____________________________________
> I suggest you write to JS @ TI to see if he has a better compiler oriented
> approach for you.
Yep, already did that and got some very helpful advice.
-Jeff
> I know of certain cases where the hand-optimization cannot
> be matched by the compiler but for most typical loops, compiler optimization
> should take you fairly close to where you want to go. JS has always been a
> big champion of letting the compiler work for you.
>
> http://ewh.ieee.org/soc/cas/dallas/documents/Sem-031606-Sankaran_RTV.pdf
>
>
> http://www.asicfpga.com/site_upgrade/asicfpga/pds/image_pds_files/472.pdf
>
> https://www.cosic.esat.kuleuven.be/publications/article-674.pdf
>
> --Bhooshan
>
> On Thu, Oct 7, 2010 at 8:41 AM, Jeff Brower wrote:
>
>>
>> [Attachment(s) <#12b8497f2fbc313f_TopText> from Jeff Brower included
>> below]
>>
>> Laurent-
>>
>> > It really seems that for the case you are talking about the compiler
>> > intrinsics are the way to go.
>>
>> Yes, or even a function call to the benchmark routine (written in
>> hand-optimized asm lang). However, we are under a
>> project constraint to only use standard C code.
>>
>> > If you want/need more help would you mind sharing at least the code for
>> one
>> > of the loops that you "think" cannot get (using intrinsics) to the
>> > performance level you expected?
>>
>> I've attached the source that we're using to test compiler optimization. It
>> looks like the compiler can generate with
>> a cycle count of about:
>>
>> nx * nh * 9
>>
>> and the hand-written benchmark about:
>>
>> nx * nh / 8
>>
>> where nx is data length and nh is filter length. That's a big difference,
>> so we're trying to determine if the
>> compiler can get closer.
>>
>> One note about the source: using "negative" x[] indexing produces a
>> markedly slower result, so we're currently
>> assuming that h[] is stored in reverse order and we would zero pad at end
>> of x[].
>>
>> -Jeff
>>
>> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower >
>> wrote:
>> >
>> >>
>> >>
>> >> Andrew-
>> >>
>> >>
>> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> >> long long _ddotpl2(long long
>> >> src1_o:src1_e, uint src2);
>> >>
>> >> But it is not something that I have tried.
>> >>
>> >> Thanks Andrew. I've seen that... but I think with intrinsics we're still
>> >> unable to reach the same level of performance as fir_r8, which is TI's
>> C64x+
>> >> benchmark routine for convolution. For one core, cycles for fir_r8 is on
>> >> the order of:
>> >>
>> >> nh * nx / 8
>> >>
>> >> where nh is filter length and nx is data length. That appears to be
>> >> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>> >> instructions in parallel.
>> >>
>> >> -Jeff
>> >>
>> >>
>> >> ------------------------------
>> >> *From:* Jeff Brower >
>> >> *To:* c...
>> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>> >>
>> >> All-
>> >>
>> >> We have been unable to find a combination of C source code and compiler
>> >> options that will cause the TI C64x+ compiler
>> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
>> >> surprising since super-efficient MAC has
>> >> been a TI staple for many years.
>> >>
>> >> Does anyone (in particular TI persons monitoring this group) know
>> whether
>> >> there is a way?
>> >>
>> >> Also, is there an app note about writing optimized C source code newer
>> than
>> >> this one:
>> >>
>> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>> >>
>> >> Thanks.
>> >>
>> >> -Jeff
>> >>
>> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>> >>
>> >> --
>> > Laurent Gauthier
>> >
>> > "They that can give up essential liberty to obtain a little temporary
>> safety
>> > deserve neither liberty nor safety."
>> > --Benjamin Franklin, 1759
>> >
>>
>> --
> -----------------------
> "I've missed more than 9000 shots in my career.
> I've lost almost 300 games. 26 times I've been trusted to take the game
> winning shot and missed.
> I've failed over and over again in my life.
> And that is why I succeed."
> -- Michael Jordan
> -----------------------
>
_____________________________________
Reply by ●October 7, 20102010-10-07
Jeff,
Are you able to make any assumptions about h_len, x_len and array alignment in
memory?
TI advise that you use
#PRAGMA MUST_ITERATE
to support compiler loop unrolling
and
_nassert(((int)x & 0x3) ==0);
to tell the compiler it can use LDDW instructions.
Of course this is no longer standard C !
- Andrew
________________________________
From: Jeff Brower
To: Laurent Gauthier
Cc: c...
Sent: Wed, October 6, 2010 11:11:21 PM
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction [1
Attachment]
[Attachment(s) from Jeff Brower included below]
Laurent-
> It really seems that for the case you are talking about the compiler
> intrinsics are the way to go.
Yes, or even a function call to the benchmark routine (written in hand-optimized
asm lang). However, we are under a
project constraint to only use standard C code.
> If you want/need more help would you mind sharing at least the code for one
> of the loops that you "think" cannot get (using intrinsics) to the
> performance level you expected?
I've attached the source that we're using to test compiler optimization. It
looks like the compiler can generate with
a cycle count of about:
nx * nh * 9
and the hand-written benchmark about:
nx * nh / 8
where nx is data length and nh is filter length. That's a big difference, so
we're trying to determine if the
compiler can get closer.
One note about the source: using "negative" x[] indexing produces a markedly
slower result, so we're currently
assuming that h[] is stored in reverse order and we would zero pad at end of
x[].
-Jeff
> On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote:
>
>> Andrew-
>> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> long long _ddotpl2(long long
>> src1_o:src1_e, uint src2);
>>
>> But it is not something that I have tried.
>>
>> Thanks Andrew. I've seen that... but I think with intrinsics we're still
>> unable to reach the same level of performance as fir_r8, which is TI's C64x+
>> benchmark routine for convolution. For one core, cycles for fir_r8 is on
>> the order of:
>>
>> nh * nx / 8
>>
>> where nh is filter length and nx is data length. That appears to be
>> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>> instructions in parallel.
>>
>> -Jeff
>> ------------------------------
>> *From:* Jeff Brower
>> *To:* c...
>> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>>
>> All-
>>
>> We have been unable to find a combination of C source code and compiler
>> options that will cause the TI C64x+ compiler
>> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
>> surprising since super-efficient MAC has
>> been a TI staple for many years.
>>
>> Does anyone (in particular TI persons monitoring this group) know whether
>> there is a way?
>>
>> Also, is there an app note about writing optimized C source code newer than
>> this one:
>>
>> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>>
>> Thanks.
>>
>> -Jeff
>>
>> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>>
>> --
> Laurent Gauthier
>
> "They that can give up essential liberty to obtain a little temporary safety
> deserve neither liberty nor safety."
> --Benjamin Franklin, 1759
>
Are you able to make any assumptions about h_len, x_len and array alignment in
memory?
TI advise that you use
#PRAGMA MUST_ITERATE
to support compiler loop unrolling
and
_nassert(((int)x & 0x3) ==0);
to tell the compiler it can use LDDW instructions.
Of course this is no longer standard C !
- Andrew
________________________________
From: Jeff Brower
To: Laurent Gauthier
Cc: c...
Sent: Wed, October 6, 2010 11:11:21 PM
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction [1
Attachment]
[Attachment(s) from Jeff Brower included below]
Laurent-
> It really seems that for the case you are talking about the compiler
> intrinsics are the way to go.
Yes, or even a function call to the benchmark routine (written in hand-optimized
asm lang). However, we are under a
project constraint to only use standard C code.
> If you want/need more help would you mind sharing at least the code for one
> of the loops that you "think" cannot get (using intrinsics) to the
> performance level you expected?
I've attached the source that we're using to test compiler optimization. It
looks like the compiler can generate with
a cycle count of about:
nx * nh * 9
and the hand-written benchmark about:
nx * nh / 8
where nx is data length and nh is filter length. That's a big difference, so
we're trying to determine if the
compiler can get closer.
One note about the source: using "negative" x[] indexing produces a markedly
slower result, so we're currently
assuming that h[] is stored in reverse order and we would zero pad at end of
x[].
-Jeff
> On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote:
>
>> Andrew-
>> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> long long _ddotpl2(long long
>> src1_o:src1_e, uint src2);
>>
>> But it is not something that I have tried.
>>
>> Thanks Andrew. I've seen that... but I think with intrinsics we're still
>> unable to reach the same level of performance as fir_r8, which is TI's C64x+
>> benchmark routine for convolution. For one core, cycles for fir_r8 is on
>> the order of:
>>
>> nh * nx / 8
>>
>> where nh is filter length and nx is data length. That appears to be
>> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>> instructions in parallel.
>>
>> -Jeff
>> ------------------------------
>> *From:* Jeff Brower
>> *To:* c...
>> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>>
>> All-
>>
>> We have been unable to find a combination of C source code and compiler
>> options that will cause the TI C64x+ compiler
>> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
>> surprising since super-efficient MAC has
>> been a TI staple for many years.
>>
>> Does anyone (in particular TI persons monitoring this group) know whether
>> there is a way?
>>
>> Also, is there an app note about writing optimized C source code newer than
>> this one:
>>
>> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>>
>> Thanks.
>>
>> -Jeff
>>
>> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>>
>> --
> Laurent Gauthier
>
> "They that can give up essential liberty to obtain a little temporary safety
> deserve neither liberty nor safety."
> --Benjamin Franklin, 1759
>