Reply by Richard Williams October 10, 20102010-10-10
Jeff,

For the referenced code, I can make the following comments.

1) the int variables sig_loop_lower_bound and sig_loop_upper_bound are on the
stack, are never set to any specific value, and therefore should not have any
_nassert() tests run against them
*I* would eliminate these variables and all _nassert() references to them.

2) the statements:
_nassert((int)h % 8 == 0);
_nassert((int)x % 8 == 0);
_nassert((int)y % 8 == 0);
should be moved to outside/before the outermost loop as these addresses to
arrays do not change, so only need to be checked once.

3) to assure that the h, x, y arrays are all properly aligned on a ?32? bit
boundary, a bit of work needs to be done in the .cmd linker file and the
appropriate lines added to the source, similar to:
#pragma DATA_SECTION(x, "xStartAddressMem")
short int x[X_LEN];
#pragma DATA_SECTION(h, "hStartAddressMem")
short int h[H_LEN];
#pragma DATA_SECTION(y, "yStartAddressMem")
short int y[H_LEN+X_LEN];

4) the calculation within the inner loop:
(x[i + j])
is highly undesirable and should be moved outside the inner loop,
and eliminate 'j',
similar to the following:
short int * pX;
short int * pH;
short int * pHmax = &(h[h_len]);
short int * pY = y;
.
.
.
sum = 0;
for (pH = h, pX = &x[i]; // pointer initialization
pH < (pHmax); // pointer comparison
pH++, pX++) // pointer increment
{
sum += ((*pH) * (*pX)); // where the DD instruction is used
} // end for()

*pY = (short int)(sum >> 15);
pY++;
R. Williams
---------- Original Message -----------
From: "Jeff Brower"
To: "Richard Williams"
Cc: c...
Sent: Fri, 8 Oct 2010 18:38:56 -0500 (CDT)
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction

> Richard-
>
> Here is the convolution source (conv_cim.c):
>
> http://groups.yahoo.com/group/c6x/attachments/folder/1393519143/item/list
>
> Evidently Yahoo groups has a script that strips off attachments and
> stores them in a central place. But, one thing we
>
> -Jeff
>
> ------------------ Original Message ----------------
> Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
> From: "Jeff Brower"
> Date: Thu, October 7, 2010 5:11 pm
> To: "Richard Williams"
> Cc: c...
> ----------------
>
> Richard-
>
> > The line:
> > sum += h[j] * x[i + j];
> > contains: x[i + j] where the max size of the x array is 256
> > where i ranges from 0 to 255
> > and j ranges from 0 to 15
> >
> > so (as an example) when i%5 and j=1
> > then the referenced address is beyond the end of the x array bounds.
> >
> > regarding the optimization...
> >
> > my first action would be to declare h_len and x_len as 'register' so no CPU
cycles are wasted accessing the values
> h_len and x_len on the stack.
> >
> > my second action would be to declare i and j as 'register' so no CPU cycles
are wasted accessing the values on the
> stack.
>
> Thanks Richard -- sharp eyes as usual. We're not actually using such
> short lengths... I probably shouldn't have sent that source. I just
> sent another one, realistic for our application (attached to reply to
> Andrew). Can you see it?
>
> -Jeff
>
> > ---------- Original Message -----------
> > From: "Jeff Brower"
> > To: "Laurent Gauthier"
> > Cc: c...
> > Sent: Wed, 6 Oct 2010 22:11:21 -0500 (CDT)
> > Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
[1 Attachment]
> >
> >> Laurent-
> >>
> >> > It really seems that for the case you are talking about the compiler
intrinsics are the way to go.
> >>
> >> Yes, or even a function call to the benchmark routine (written in hand-
optimized asm lang). However, we are under
> a project constraint to only use standard C code.
> >>
> >> > If you want/need more help would you mind sharing at least the code for
one of the loops that you "think" cannot
> get (using intrinsics) to the performance level you expected?
> >>
> >> I've attached the source that we're using to test compiler
> >> optimization. It looks like the compiler can generate with a cycle count
of about:
> >>
> >> nx * nh * 9
> >>
> >> and the hand-written benchmark about:
> >>
> >> nx * nh / 8
> >>
> >> where nx is data length and nh is filter length. That's a big
> >> difference, so we're trying to determine if the compiler can get closer.
> >>
> >> One note about the source: using "negative" x[] indexing produces a
markedly slower result, so we're currently
> assuming that h[] is stored in reverse order and we would zero pad at
> end of x[].
> >>
> >> -Jeff
> >>
> >> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower
wrote:
> >> >
> >> >>
> >> >>
> >> >> Andrew-
> >> >>
> >> >>
> >> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
> >> >> long long _ddotpl2(long long
> >> >> src1_o:src1_e, uint src2);
> >> >>
> >> >> But it is not something that I have tried.
> >> >>
> >> >> Thanks Andrew. I've seen that... but I think with intrinsics we're
still
> >> >> unable to reach the same level of performance as fir_r8, which is TI's
> > C64x+
> >> >> benchmark routine for convolution. For one core, cycles for fir_r8 is
on the order of:
> >> >>
> >> >> nh * nx / 8
> >> >>
> >> >> where nh is filter length and nx is data length. That appears to be
achieved by a "few" DDOTPL2s in parallel,
> plus other groups of various instructions in parallel.
> >> >>
> >> >> -Jeff
> >> >>
> >> >>
> >> >> ------------------------------
> >> >> *From:* Jeff Brower
> >> >> *To:* c...
> >> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
> >> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
> >> >>
> >> >> All-
> >> >>
> >> >> We have been unable to find a combination of C source code and compiler
options that will cause the TI C64x+
> compiler
> >> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
surprising since super-efficient MAC has
> >> >> been a TI staple for many years.
> >> >>
> >> >> Does anyone (in particular TI persons monitoring this group) know
whether there is a way?
> >> >>
> >> >> Also, is there an app note about writing optimized C source code newer
than this one:
> >> >>
> >> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
> >> >>
> >> >> Thanks.
> >> >>
> >> >> -Jeff
> >> >>
> >> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
> >> >>
> >> >> --
> >> > Laurent Gauthier
> >> >
> >> > "They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety."
> >> > --Benjamin Franklin, 1759
> >> >
> > ------- End of Original Message -------
------- End of Original Message -------

_____________________________________
Reply by Jeff Brower October 8, 20102010-10-08
All-

I just got told the "less-than asterisk greater-than" thing killed some text in my last post. That thing is vicious.
Here is the text again... I used (*) instead.

Here is the convolution source (conv_cim.c):

http://groups.yahoo.com/group/c6x/attachments/folder/1393519143/item/list

Evidently Yahoo groups has a script that strips off attachments and stores them in a central place. But, one thing we
noticed is that the script adds some references to the attachment prefixed by (*) and, unfortunately, some HTML mail
clients see this as a tag and kill the references. We found this because we can only see references to attachments in
our text-based mail clients.

-Jeff

------------------ Original Message ----------------
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
From: "Jeff Brower"
Date: Fri, October 8, 2010 6:38 pm
To: "Richard Williams"
Cc: c...
----------------

Richard-

Here is the convolution source (conv_cim.c):

http://groups.yahoo.com/group/c6x/attachments/folder/1393519143/item/list

Evidently Yahoo groups has a script that strips off attachments and stores them in a central place. But, one thing we

-Jeff

------------------ Original Message ----------------
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction From: "Jeff Brower"

Date: Thu, October 7, 2010 5:11 pm
To: "Richard Williams"
Cc: c...
----------------

Richard-

> The line:
> sum += h[j] * x[i + j];
> contains: x[i + j] where the max size of the x array is 256
> where i ranges from 0 to 255
> and j ranges from 0 to 15
>
> so (as an example) when i%5 and j=1
> then the referenced address is beyond the end of the x array bounds.
>
> regarding the optimization...
>
> my first action would be to declare h_len and x_len as 'register' so no CPU cycles are wasted accessing the values
h_len and x_len on the stack.
>
> my second action would be to declare i and j as 'register' so no CPU cycles are wasted accessing the values on the
stack.

Thanks Richard -- sharp eyes as usual. We're not actually using such short lengths... I probably shouldn't have sent
that source. I just sent another one, realistic for our application (attached to reply to Andrew). Can you see it?

-Jeff

> ---------- Original Message -----------
> From: "Jeff Brower"
> To: "Laurent Gauthier"
> Cc: c...
> Sent: Wed, 6 Oct 2010 22:11:21 -0500 (CDT)
> Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction [1 Attachment]
>
>> Laurent-
>>
>> > It really seems that for the case you are talking about the compiler intrinsics are the way to go.
>>
>> Yes, or even a function call to the benchmark routine (written in hand- optimized asm lang). However, we are under
a project constraint to only use standard C code.
>>
>> > If you want/need more help would you mind sharing at least the code for one of the loops that you "think" cannot
get (using intrinsics) to the performance level you expected?
>>
>> I've attached the source that we're using to test compiler
>> optimization. It looks like the compiler can generate with a cycle count of about:
>>
>> nx * nh * 9
>>
>> and the hand-written benchmark about:
>>
>> nx * nh / 8
>>
>> where nx is data length and nh is filter length. That's a big
>> difference, so we're trying to determine if the compiler can get closer.
>>
>> One note about the source: using "negative" x[] indexing produces a markedly slower result, so we're currently
assuming that h[] is stored in reverse order and we would zero pad at end of x[].
>>
>> -Jeff
>>
>> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote:
>> >
>> >>
>> >>
>> >> Andrew-
>> >>
>> >>
>> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> >> long long _ddotpl2(long long
>> >> src1_o:src1_e, uint src2);
>> >>
>> >> But it is not something that I have tried.
>> >>
>> >> Thanks Andrew. I've seen that... but I think with intrinsics we're still
>> >> unable to reach the same level of performance as fir_r8, which is TI's
> C64x+
>> >> benchmark routine for convolution. For one core, cycles for fir_r8 is on the order of:
>> >>
>> >> nh * nx / 8
>> >>
>> >> where nh is filter length and nx is data length. That appears to be achieved by a "few" DDOTPL2s in parallel,
plus other groups of various instructions in parallel.
>> >>
>> >> -Jeff
>> >>
>> >>
>> >> ------------------------------
>> >> *From:* Jeff Brower
>> >> *To:* c...
>> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>> >>
>> >> All-
>> >>
>> >> We have been unable to find a combination of C source code and compiler options that will cause the TI C64x+
compiler
>> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that surprising since super-efficient MAC
has been a TI staple for many years.
>> >>
>> >> Does anyone (in particular TI persons monitoring this group) know whether there is a way?
>> >>
>> >> Also, is there an app note about writing optimized C source code newer than this one:
>> >>
>> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>> >>
>> >> Thanks.
>> >>
>> >> -Jeff
>> >>
>> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>> >>
>> >> --
>> > Laurent Gauthier
>> >
>> > "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety."
>> > --Benjamin Franklin, 1759
>> >
> ------- End of Original Message -------

_____________________________________
Reply by Jeff Brower October 8, 20102010-10-08
Richard-

Here is the convolution source (conv_cim.c):

http://groups.yahoo.com/group/c6x/attachments/folder/1393519143/item/list

Evidently Yahoo groups has a script that strips off attachments and stores them in a central place. But, one thing we

-Jeff

------------------ Original Message ----------------
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
From: "Jeff Brower"
Date: Thu, October 7, 2010 5:11 pm
To: "Richard Williams"
Cc: c...
----------------

Richard-

> The line:
> sum += h[j] * x[i + j];
> contains: x[i + j] where the max size of the x array is 256
> where i ranges from 0 to 255
> and j ranges from 0 to 15
>
> so (as an example) when i%5 and j=1
> then the referenced address is beyond the end of the x array bounds.
>
> regarding the optimization...
>
> my first action would be to declare h_len and x_len as 'register' so no CPU cycles are wasted accessing the values
h_len and x_len on the stack.
>
> my second action would be to declare i and j as 'register' so no CPU cycles are wasted accessing the values on the
stack.

Thanks Richard -- sharp eyes as usual. We're not actually using such short lengths... I probably shouldn't have sent
that source. I just sent another one, realistic for our application (attached to reply to Andrew). Can you see it?

-Jeff

> ---------- Original Message -----------
> From: "Jeff Brower"
> To: "Laurent Gauthier"
> Cc: c...
> Sent: Wed, 6 Oct 2010 22:11:21 -0500 (CDT)
> Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction [1 Attachment]
>
>> Laurent-
>>
>> > It really seems that for the case you are talking about the compiler intrinsics are the way to go.
>>
>> Yes, or even a function call to the benchmark routine (written in hand- optimized asm lang). However, we are under
a project constraint to only use standard C code.
>>
>> > If you want/need more help would you mind sharing at least the code for one of the loops that you "think" cannot
get (using intrinsics) to the performance level you expected?
>>
>> I've attached the source that we're using to test compiler
>> optimization. It looks like the compiler can generate with a cycle count of about:
>>
>> nx * nh * 9
>>
>> and the hand-written benchmark about:
>>
>> nx * nh / 8
>>
>> where nx is data length and nh is filter length. That's a big
>> difference, so we're trying to determine if the compiler can get closer.
>>
>> One note about the source: using "negative" x[] indexing produces a markedly slower result, so we're currently
assuming that h[] is stored in reverse order and we would zero pad at end of x[].
>>
>> -Jeff
>>
>> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote:
>> >
>> >>
>> >>
>> >> Andrew-
>> >>
>> >>
>> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> >> long long _ddotpl2(long long
>> >> src1_o:src1_e, uint src2);
>> >>
>> >> But it is not something that I have tried.
>> >>
>> >> Thanks Andrew. I've seen that... but I think with intrinsics we're still
>> >> unable to reach the same level of performance as fir_r8, which is TI's
> C64x+
>> >> benchmark routine for convolution. For one core, cycles for fir_r8 is on the order of:
>> >>
>> >> nh * nx / 8
>> >>
>> >> where nh is filter length and nx is data length. That appears to be achieved by a "few" DDOTPL2s in parallel,
plus other groups of various instructions in parallel.
>> >>
>> >> -Jeff
>> >>
>> >>
>> >> ------------------------------
>> >> *From:* Jeff Brower
>> >> *To:* c...
>> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>> >>
>> >> All-
>> >>
>> >> We have been unable to find a combination of C source code and compiler options that will cause the TI C64x+
compiler
>> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that surprising since super-efficient MAC has
>> >> been a TI staple for many years.
>> >>
>> >> Does anyone (in particular TI persons monitoring this group) know whether there is a way?
>> >>
>> >> Also, is there an app note about writing optimized C source code newer than this one:
>> >>
>> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>> >>
>> >> Thanks.
>> >>
>> >> -Jeff
>> >>
>> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>> >>
>> >> --
>> > Laurent Gauthier
>> >
>> > "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety."
>> > --Benjamin Franklin, 1759
>> >
> ------- End of Original Message -------

_____________________________________
Reply by Vikram Ragukumar October 7, 20102010-10-07
Jagadeesh,

> The compiler requires the use of intrinsics to match the C64x+ _ddotpl2 instruction, please refer c6x.h include file for the expected prototype.
>
> I have been corresponding with Jeff directly, from natural C code which is their project's requirement, annotating the C code with _nasserts to give the compiler
> information to unroll the loop, for example as follows:

After incorporating #pragma MUST_ITERATE and _nassert() as you
suggested, we are seeing comparable timing figures in the generated
assembly output.

Also, we now have a better understanding of how to interpret the cycle
count information generated by the compiler.

Thanks and Regards,
Vikram.

>
> /*-------------------------------*/
> /* The following assumptions are made if NOASSUME is not defined */
> /* It is assumed that the number of output samples>= 2. It is also */
> /* assumed that the number of output samples to be computed is a */
> /* multiple of 2. In addition it is assumed, that the number of */
> /* filter taps is>= 4, and a multiple of 4. */
> /*-------------------------------*/
>
> #ifndef NOASSUME
> _nassert(nr>= 2);
> _nassert(nr % 2 == 0);
> _nassert(nh>= 4);
> _nassert(nh % 4 == 0);
> #pragma MUST_ITERATE(4,,4);
> #endif
>
> for (j = 0; j< nr; j++)
> {
>
> /*----------------------------*/
> /* Initizlize accumulator for FIR sum to be zero. */
> /*----------------------------*/
>
> sum = 0;
>
> /*----------------------------*/
> /* The following assumptions are made if noassume is defined. */
> /* It is assumed that the input, filter and output pointers */
> /* are dword aligned. In addition it is assumed that the # */
> /* of filter taps is at least 8, or a multiple of 8. */
> /*----------------------------*/
>
> #ifndef NOASSUME
> _nassert((int)x % 8 == 0);
> _nassert((int)h % 8 == 0);
> _nassert((int)r % 8 == 0);
> #pragma MUST_ITERATE(8,,8);
> #endif
>
> /*---------------------------*/
> /* Compute FIR as sum of products. */
> /*---------------------------*/
>
> for (i = 0; i< nh; i++)
> {
> sum += x[i + j] * h[i];
> }
>
> /*---------------------------*/
> /* Shift out FIR sum and store out. */
> /*---------------------------*/
>
> r[j] = sum>> 15;
> }
> }
>
> When we compile with cl6x -k -o3 -mwth -mv6400 or -mv6400+ we will see 16 DOTP2's in 9 cycles,
> which achieves 32 multiplies in 9 cycles, or about 3.66 multiplies per cycle is achievable by
> annotating natural C code.
>
> C6x Compiler generated assembly:
>
> ;** --*
> $C$L4: ; PIPED LOOP KERNEL
> ; EXCLUSIVE CPU CYCLES: 9
>
> ADD .L1 A4,A21,A21 ; |95|<0,10>
> || DOTP2 .M2X B7,A3,B6 ; |95|<0,10>
> || DOTP2 .M1X B6,A3,A4 ; |95|<0,10>
> || LDW .D2T1 *B23++,A3 ; |95|<1,1> ADD .L1 A4,A19,A19 ; |95|<0,11>
> || ADD .L2 B8,B19,B19 ; |95|<0,11>
> || DOTP2 .M1X B6,A3,A4 ; |95|<0,11>
> || LDNDW .D1T2 *+A8(12),B7:B6 ; |95|<1,2> DOTP2 .M2X B16,A3,B8 ; |95|<0,12>
> || ADD .L2 B6,B18,B18 ; |95|<0,12>
> || DOTP2 .M1X B9,A3,A4 ; |95|<0,12>
> || ADD .L1 A4,A9,A9 ; |95|<0,12>
> || LDNDW .D1T1 *+A8(14),A5:A4 ; |95|<1,3> [ B0] BDEC .S2 $C$L4,B0 ; |95|<0,13>
> || DOTP2 .M2X B8,A3,B7 ; |95|<0,13>
> || DOTP2 .M1X B7,A3,A22 ; |95|<0,13>
> || ADD .L1 A4,A16,A16 ; |95|<0,13>
> || ADD .L2 B6,B21,B21 ; |95|<0,13>
> || LDNDW .D1T2 *+A8(20),B9:B8 ; |95|<1,4> DOTP2 .M1X B17,A3,A4 ; |95|<0,14>
> || DOTP2 .M2X B9,A3,B6 ; |95|<0,14>
> || ADD .L1 A4,A7,A7 ; |95|<0,14>
> || ADD .L2 B6,B22,B22 ; |95|<0,14>
> || LDNDW .D1T2 *+A8(22),B7:B6 ; |95|<1,5> ADD .L1 A4,A17,A17 ; |95|<0,15>
> || DOTP2 .M1 A23,A3,A4 ; |95|<1,6>
> || LDNDW .D1T2 *+A8(6),B7:B6 ; |95|<1,6> ADD .L2 B8,B4,B4 ; |95|<0,16>
> || ADD .L1 A4,A6,A6 ; |95|<0,16>
> || LDNDW .D1T2 *-A8(2),B17:B16 ; |95|<1,7>
> || DOTP2 .M2X B7,A3,B8 ; |95|<1,7>
> || DOTP2 .M1 A22,A3,A4 ; |95|<1,7> ADD .L2 B7,B5,B5 ; |95|<0,17>
> || ADD .L1 A22,A18,A18 ; |95|<0,17>
> || DOTP2 .M2X B6,A3,B6 ; |95|<1,8>
> || LDNDW .D1T2 *+A8(4),B9:B8 ; |95|<1,8>
> || DOTP2 .M1 A4,A3,A4 ; |95|<1,8> ADD .L1 A4,A20,A20 ; |95|<0,18>
> || ADD .L2 B6,B20,B20 ; |95|<0,18>
> || DOTP2 .M1 A5,A3,A4 ; |95|<1,9>
> || DOTP2 .M2X B8,A3,B6 ; |95|<1,9>
> || LDNDW .D1T1 *A8++(4),A23:A22 ; |95|<2,0> Regards
> JS
>
> ________________________________
> From: c... [mailto:c...] On Behalf Of Bhooshan Iyer
> Sent: Thursday, October 07, 2010 6:45 AM
> To: Jeff Brower
> Cc: Laurent Gauthier; c...
> Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>
> Since this is a newer instruction targeting C64x+ cores, you should also refer 64x to 64x+ software migration documents like these for clues on how and whether you can get the compiler to generate code using DDOTPL2.
>
> http://focus.ti.com/lit/an/spraa84a/spraa84a.pdf
>
> Apart from this I believe TI also used to employ register TDM on their benchmarks as opposed to loop-unrolling done by the compiler. My understanding of the compiler of those days was that the behaviour was not yet available thru the C compiler.
>
> Its possible TI may release an under-development compiler to you with these extensions, if they exist at all. Worth a shot.
>
> --Bhooshan
> On Thu, Oct 7, 2010 at 4:36 PM, Bhooshan Iyer> wrote:
> Jeff--
> I suggest you write to JS @ TI to see if he has a better compiler oriented approach for you. I know of certain cases where the hand-optimization cannot be matched by the compiler but for most typical loops, compiler optimization should take you fairly close to where you want to go. JS has always been a big champion of letting the compiler work for you.
>
> http://ewh.ieee.org/soc/cas/dallas/documents/Sem-031606-Sankaran_RTV.pdf
>
> http://www.asicfpga.com/site_upgrade/asicfpga/pds/image_pds_files/472.pdf
>
> https://www.cosic.esat.kuleuven.be/publications/article-674.pdf
>
> --Bhooshan
> On Thu, Oct 7, 2010 at 8:41 AM, Jeff Brower> wrote:
>
> [Attachment(s) from Jeff Brower included below]
>
> Laurent-
>> It really seems that for the case you are talking about the compiler
>> intrinsics are the way to go.
> Yes, or even a function call to the benchmark routine (written in hand-optimized asm lang). However, we are under a
> project constraint to only use standard C code.
>> If you want/need more help would you mind sharing at least the code for one
>> of the loops that you "think" cannot get (using intrinsics) to the
>> performance level you expected?
> I've attached the source that we're using to test compiler optimization. It looks like the compiler can generate with
> a cycle count of about:
>
> nx * nh * 9
>
> and the hand-written benchmark about:
>
> nx * nh / 8
>
> where nx is data length and nh is filter length. That's a big difference, so we're trying to determine if the
> compiler can get closer.
>
> One note about the source: using "negative" x[] indexing produces a markedly slower result, so we're currently
> assuming that h[] is stored in reverse order and we would zero pad at end of x[].
>
> -Jeff
>> On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower> wrote:
>>
>>>
>>>
>>> Andrew-
>>>
>>>
>>> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>>> long long _ddotpl2(long long
>>> src1_o:src1_e, uint src2);
>>>
>>> But it is not something that I have tried.
>>>
>>> Thanks Andrew. I've seen that... but I think with intrinsics we're still
>>> unable to reach the same level of performance as fir_r8, which is TI's C64x+
>>> benchmark routine for convolution. For one core, cycles for fir_r8 is on
>>> the order of:
>>>
>>> nh * nx / 8
>>>
>>> where nh is filter length and nx is data length. That appears to be
>>> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>>> instructions in parallel.
>>>
>>> -Jeff
>>>
>>>
>>> ------------------------------
>>> *From:* Jeff Brower>
>>> *To:* c...
>>> *Sent:* Mon, October 4, 2010 7:50:03 PM
>>> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>>>
>>> All-
>>>
>>> We have been unable to find a combination of C source code and compiler
>>> options that will cause the TI C64x+ compiler
>>> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
>>> surprising since super-efficient MAC has
>>> been a TI staple for many years.
>>>
>>> Does anyone (in particular TI persons monitoring this group) know whether
>>> there is a way?
>>>
>>> Also, is there an app note about writing optimized C source code newer than
>>> this one:
>>>
>>> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>>>
>>> Thanks.
>>>
>>> -Jeff
>>>
>>> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>>>
>>> --
>

_____________________________________
Reply by Jeff Brower October 7, 20102010-10-07
Richard-

> The line:
> sum += h[j] * x[i + j];
> contains: x[i + j] where the max size of the x array is 256
> where i ranges from 0 to 255
> and j ranges from 0 to 15
>
> so (as an example) when i%5 and j=1
> then the referenced address is beyond the end of the x array bounds.
>
> regarding the optimization...
>
> my first action would be to declare h_len and x_len as 'register' so no CPU
> cycles are wasted accessing the values h_len and x_len on the stack.
>
> my second action would be to declare i and j as 'register' so no CPU cycles are
> wasted accessing the values on the stack.

Thanks Richard -- sharp eyes as usual. We're not actually using such short lengths... I probably shouldn't have sent
that source. I just sent another one, realistic for our application (attached to reply to Andrew). Can you see it?

-Jeff

> ---------- Original Message -----------
> From: "Jeff Brower"
> To: "Laurent Gauthier"
> Cc: c...
> Sent: Wed, 6 Oct 2010 22:11:21 -0500 (CDT)
> Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction [1
> Attachment]
>
>> Laurent-
>>
>> > It really seems that for the case you are talking about the compiler
>> > intrinsics are the way to go.
>>
>> Yes, or even a function call to the benchmark routine (written in hand-
>> optimized asm lang). However, we are under a project constraint to
>> only use standard C code.
>>
>> > If you want/need more help would you mind sharing at least the code for one
>> > of the loops that you "think" cannot get (using intrinsics) to the
>> > performance level you expected?
>>
>> I've attached the source that we're using to test compiler
>> optimization. It looks like the compiler can generate with a cycle
>> count of about:
>>
>> nx * nh * 9
>>
>> and the hand-written benchmark about:
>>
>> nx * nh / 8
>>
>> where nx is data length and nh is filter length. That's a big
>> difference, so we're trying to determine if the compiler can get closer.
>>
>> One note about the source: using "negative" x[] indexing produces a
>> markedly slower result, so we're currently assuming that h[] is stored
>> in reverse order and we would zero pad at end of x[].
>>
>> -Jeff
>>
>> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote:
>> >
>> >>
>> >>
>> >> Andrew-
>> >>
>> >>
>> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> >> long long _ddotpl2(long long
>> >> src1_o:src1_e, uint src2);
>> >>
>> >> But it is not something that I have tried.
>> >>
>> >> Thanks Andrew. I've seen that... but I think with intrinsics we're still
>> >> unable to reach the same level of performance as fir_r8, which is TI's
> C64x+
>> >> benchmark routine for convolution. For one core, cycles for fir_r8 is on
>> >> the order of:
>> >>
>> >> nh * nx / 8
>> >>
>> >> where nh is filter length and nx is data length. That appears to be
>> >> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>> >> instructions in parallel.
>> >>
>> >> -Jeff
>> >>
>> >>
>> >> ------------------------------
>> >> *From:* Jeff Brower
>> >> *To:* c...
>> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>> >>
>> >> All-
>> >>
>> >> We have been unable to find a combination of C source code and compiler
>> >> options that will cause the TI C64x+ compiler
>> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
>> >> surprising since super-efficient MAC has
>> >> been a TI staple for many years.
>> >>
>> >> Does anyone (in particular TI persons monitoring this group) know whether
>> >> there is a way?
>> >>
>> >> Also, is there an app note about writing optimized C source code newer than
>> >> this one:
>> >>
>> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>> >>
>> >> Thanks.
>> >>
>> >> -Jeff
>> >>
>> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>> >>
>> >> --
>> > Laurent Gauthier
>> >
>> > "They that can give up essential liberty to obtain a little temporary safety
>> > deserve neither liberty nor safety."
>> > --Benjamin Franklin, 1759
>> >
> ------- End of Original Message -------

_____________________________________
Reply by Jeff Brower October 7, 20102010-10-07
Reply by Jeff Brower October 7, 20102010-10-07
Bhooshan-

> Since this is a newer instruction targeting C64x+ cores, you should also
> refer 64x to 64x+ software migration documents like these for clues on how
> and whether you can get the compiler to generate code using DDOTPL2.
>
> http://focus.ti.com/lit/an/spraa84a/spraa84a.pdf
>
> Apart from this I believe TI also used to employ register TDM on their
> benchmarks as opposed to loop-unrolling done by the compiler. My
> understanding of the compiler of those days was that the behaviour was not
> yet available thru the C compiler.
>
> Its possible TI may release an under-development compiler to you with these
> extensions, if they exist at all. Worth a shot.

Thanks Bhooshan, very good advice.

-Jeff

> On Thu, Oct 7, 2010 at 4:36 PM, Bhooshan Iyer wrote:
>
>> Jeff--
>> I suggest you write to JS @ TI to see if he has a better compiler oriented
>> approach for you. I know of certain cases where the hand-optimization cannot
>> be matched by the compiler but for most typical loops, compiler optimization
>> should take you fairly close to where you want to go. JS has always been a
>> big champion of letting the compiler work for you.
>>
>> http://ewh.ieee.org/soc/cas/dallas/documents/Sem-031606-Sankaran_RTV.pdf
>>
>>
>> http://www.asicfpga.com/site_upgrade/asicfpga/pds/image_pds_files/472.pdf
>>
>> https://www.cosic.esat.kuleuven.be/publications/article-674.pdf
>>
>> --Bhooshan
>>
>> On Thu, Oct 7, 2010 at 8:41 AM, Jeff Brower wrote:
>>
>>>
>>> [Attachment(s) <#12b865f7e93f0d04_12b8497f2fbc313f_TopText> from Jeff
>>> Brower included below]
>>>
>>> Laurent-
>>>
>>>
>>> > It really seems that for the case you are talking about the compiler
>>> > intrinsics are the way to go.
>>>
>>> Yes, or even a function call to the benchmark routine (written in
>>> hand-optimized asm lang). However, we are under a
>>> project constraint to only use standard C code.
>>>
>>>
>>> > If you want/need more help would you mind sharing at least the code for
>>> one
>>> > of the loops that you "think" cannot get (using intrinsics) to the
>>> > performance level you expected?
>>>
>>> I've attached the source that we're using to test compiler optimization.
>>> It looks like the compiler can generate with
>>> a cycle count of about:
>>>
>>> nx * nh * 9
>>>
>>> and the hand-written benchmark about:
>>>
>>> nx * nh / 8
>>>
>>> where nx is data length and nh is filter length. That's a big difference,
>>> so we're trying to determine if the
>>> compiler can get closer.
>>>
>>> One note about the source: using "negative" x[] indexing produces a
>>> markedly slower result, so we're currently
>>> assuming that h[] is stored in reverse order and we would zero pad at end
>>> of x[].
>>>
>>> -Jeff
>>>
>>>
>>> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower >
>>> wrote:
>>> >
>>> >>
>>> >>
>>> >> Andrew-
>>> >>
>>> >>
>>> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>>> >> long long _ddotpl2(long long
>>> >> src1_o:src1_e, uint src2);
>>> >>
>>> >> But it is not something that I have tried.
>>> >>
>>> >> Thanks Andrew. I've seen that... but I think with intrinsics we're
>>> still
>>> >> unable to reach the same level of performance as fir_r8, which is TI's
>>> C64x+
>>> >> benchmark routine for convolution. For one core, cycles for fir_r8 is
>>> on
>>> >> the order of:
>>> >>
>>> >> nh * nx / 8
>>> >>
>>> >> where nh is filter length and nx is data length. That appears to be
>>> >> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>>> >> instructions in parallel.
>>> >>
>>> >> -Jeff
>>> >>
>>> >>
>>> >> ------------------------------
>>> >> *From:* Jeff Brower
>>> >
>>> >> *To:* c...
>>> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
>>> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2
>>> instruction
>>> >>
>>> >> All-
>>> >>
>>> >> We have been unable to find a combination of C source code and compiler
>>> >> options that will cause the TI C64x+ compiler
>>> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find
>>> that
>>> >> surprising since super-efficient MAC has
>>> >> been a TI staple for many years.
>>> >>
>>> >> Does anyone (in particular TI persons monitoring this group) know
>>> whether
>>> >> there is a way?
>>> >>
>>> >> Also, is there an app note about writing optimized C source code newer
>>> than
>>> >> this one:
>>> >>
>>> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>>> >>
>>> >> Thanks.
>>> >>
>>> >> -Jeff
>>> >>
>>> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>>> >>
>>> >> --
>>>
>

_____________________________________
Reply by Andrew Elder October 7, 20102010-10-07
Hi Richard,

My understanding is that there is no reason to *ever* declare a variable
"register" when writing C6000 code. The compiler is smart enough to assign the
variable to a register.

Can someone please jump in and correct me if I am wrong ?

- Andrew

________________________________
From: Richard Williams
To: Jeff Brower ; Laurent Gauthier

Cc: c...
Sent: Thu, October 7, 2010 9:10:38 AM
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction

Jeff, Gauthier,

The line:
sum += h[j] * x[i + j];
contains: x[i + j] where the max size of the x array is 256
where i ranges from 0 to 255
and j ranges from 0 to 15

so (as an example) when i%5 and j=1
then the referenced address is beyond the end of the x array bounds.

regarding the optimization...

my first action would be to declare h_len and x_len as 'register' so no CPU
cycles are wasted accessing the values h_len and x_len on the stack.

my second action would be to declare i and j as 'register' so no CPU cycles are
wasted accessing the values on the stack.

R. Williams

---------- Original Message -----------
From: "Jeff Brower"
To: "Laurent Gauthier"
Cc: c...
Sent: Wed, 6 Oct 2010 22:11:21 -0500 (CDT)
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction [1
Attachment]

> Laurent-
>
> > It really seems that for the case you are talking about the compiler
> > intrinsics are the way to go.
>
> Yes, or even a function call to the benchmark routine (written in hand-
> optimized asm lang). However, we are under a project constraint to
> only use standard C code.
>
> > If you want/need more help would you mind sharing at least the code for one
> > of the loops that you "think" cannot get (using intrinsics) to the
> > performance level you expected?
>
> I've attached the source that we're using to test compiler
> optimization. It looks like the compiler can generate with a cycle
> count of about:
>
> nx * nh * 9
>
> and the hand-written benchmark about:
>
> nx * nh / 8
>
> where nx is data length and nh is filter length. That's a big
> difference, so we're trying to determine if the compiler can get closer.
>
> One note about the source: using "negative" x[] indexing produces a
> markedly slower result, so we're currently assuming that h[] is stored
> in reverse order and we would zero pad at end of x[].
>
> -Jeff
>
> > On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote:
> >
> >>
> >>
> >> Andrew-
> >>
> >>
> >> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
> >> long long _ddotpl2(long long
> >> src1_o:src1_e, uint src2);
> >>
> >> But it is not something that I have tried.
> >>
> >> Thanks Andrew. I've seen that... but I think with intrinsics we're still
> >> unable to reach the same level of performance as fir_r8, which is TI's
C64x+
> >> benchmark routine for convolution. For one core, cycles for fir_r8 is on
> >> the order of:
> >>
> >> nh * nx / 8
> >>
> >> where nh is filter length and nx is data length. That appears to be
> >> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
> >> instructions in parallel.
> >>
> >> -Jeff
> >>
> >>
> >> ------------------------------
> >> *From:* Jeff Brower
> >> *To:* c...
> >> *Sent:* Mon, October 4, 2010 7:50:03 PM
> >> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
> >>
> >> All-
> >>
> >> We have been unable to find a combination of C source code and compiler
> >> options that will cause the TI C64x+ compiler
> >> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
> >> surprising since super-efficient MAC has
> >> been a TI staple for many years.
> >>
> >> Does anyone (in particular TI persons monitoring this group) know whether
> >> there is a way?
> >>
> >> Also, is there an app note about writing optimized C source code newer than
> >> this one:
> >>
> >> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
> >>
> >> Thanks.
> >>
> >> -Jeff
> >>
> >> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
> >>
> >> --
> > Laurent Gauthier
> >
> > "They that can give up essential liberty to obtain a little temporary safety
> > deserve neither liberty nor safety."
> > --Benjamin Franklin, 1759
> >
------- End of Original Message -------
Reply by "Sankaran, Jagadeesh" October 7, 20102010-10-07
The compiler requires the use of intrinsics to match the C64x+ _ddotpl2 instruction, please refer c6x.h include file for the expected prototype.

I have been corresponding with Jeff directly, from natural C code which is their project's requirement, annotating the C code with _nasserts to give the compiler
information to unroll the loop, for example as follows:

/*-------------------------------*/
/* The following assumptions are made if NOASSUME is not defined */
/* It is assumed that the number of output samples >= 2. It is also */
/* assumed that the number of output samples to be computed is a */
/* multiple of 2. In addition it is assumed, that the number of */
/* filter taps is >= 4, and a multiple of 4. */
/*-------------------------------*/

#ifndef NOASSUME
_nassert(nr >= 2);
_nassert(nr % 2 == 0);
_nassert(nh >= 4);
_nassert(nh % 4 == 0);
#pragma MUST_ITERATE(4,,4);
#endif

for (j = 0; j < nr; j++)
{

/*----------------------------*/
/* Initizlize accumulator for FIR sum to be zero. */
/*----------------------------*/

sum = 0;

/*----------------------------*/
/* The following assumptions are made if noassume is defined. */
/* It is assumed that the input, filter and output pointers */
/* are dword aligned. In addition it is assumed that the # */
/* of filter taps is at least 8, or a multiple of 8. */
/*----------------------------*/

#ifndef NOASSUME
_nassert((int)x % 8 == 0);
_nassert((int)h % 8 == 0);
_nassert((int)r % 8 == 0);
#pragma MUST_ITERATE(8,,8);
#endif

/*---------------------------*/
/* Compute FIR as sum of products. */
/*---------------------------*/

for (i = 0; i < nh; i++)
{
sum += x[i + j] * h[i];
}

/*---------------------------*/
/* Shift out FIR sum and store out. */
/*---------------------------*/

r[j] = sum >> 15;
}
}

When we compile with cl6x -k -o3 -mwth -mv6400 or -mv6400+ we will see 16 DOTP2's in 9 cycles,
which achieves 32 multiplies in 9 cycles, or about 3.66 multiplies per cycle is achievable by
annotating natural C code.

C6x Compiler generated assembly:

;** --*
$C$L4: ; PIPED LOOP KERNEL
; EXCLUSIVE CPU CYCLES: 9

ADD .L1 A4,A21,A21 ; |95| <0,10>
|| DOTP2 .M2X B7,A3,B6 ; |95| <0,10>
|| DOTP2 .M1X B6,A3,A4 ; |95| <0,10>
|| LDW .D2T1 *B23++,A3 ; |95| <1,1>

ADD .L1 A4,A19,A19 ; |95| <0,11>
|| ADD .L2 B8,B19,B19 ; |95| <0,11>
|| DOTP2 .M1X B6,A3,A4 ; |95| <0,11>
|| LDNDW .D1T2 *+A8(12),B7:B6 ; |95| <1,2>

DOTP2 .M2X B16,A3,B8 ; |95| <0,12>
|| ADD .L2 B6,B18,B18 ; |95| <0,12>
|| DOTP2 .M1X B9,A3,A4 ; |95| <0,12>
|| ADD .L1 A4,A9,A9 ; |95| <0,12>
|| LDNDW .D1T1 *+A8(14),A5:A4 ; |95| <1,3>

[ B0] BDEC .S2 $C$L4,B0 ; |95| <0,13>
|| DOTP2 .M2X B8,A3,B7 ; |95| <0,13>
|| DOTP2 .M1X B7,A3,A22 ; |95| <0,13>
|| ADD .L1 A4,A16,A16 ; |95| <0,13>
|| ADD .L2 B6,B21,B21 ; |95| <0,13>
|| LDNDW .D1T2 *+A8(20),B9:B8 ; |95| <1,4>

DOTP2 .M1X B17,A3,A4 ; |95| <0,14>
|| DOTP2 .M2X B9,A3,B6 ; |95| <0,14>
|| ADD .L1 A4,A7,A7 ; |95| <0,14>
|| ADD .L2 B6,B22,B22 ; |95| <0,14>
|| LDNDW .D1T2 *+A8(22),B7:B6 ; |95| <1,5>

ADD .L1 A4,A17,A17 ; |95| <0,15>
|| DOTP2 .M1 A23,A3,A4 ; |95| <1,6>
|| LDNDW .D1T2 *+A8(6),B7:B6 ; |95| <1,6>

ADD .L2 B8,B4,B4 ; |95| <0,16>
|| ADD .L1 A4,A6,A6 ; |95| <0,16>
|| LDNDW .D1T2 *-A8(2),B17:B16 ; |95| <1,7>
|| DOTP2 .M2X B7,A3,B8 ; |95| <1,7>
|| DOTP2 .M1 A22,A3,A4 ; |95| <1,7>

ADD .L2 B7,B5,B5 ; |95| <0,17>
|| ADD .L1 A22,A18,A18 ; |95| <0,17>
|| DOTP2 .M2X B6,A3,B6 ; |95| <1,8>
|| LDNDW .D1T2 *+A8(4),B9:B8 ; |95| <1,8>
|| DOTP2 .M1 A4,A3,A4 ; |95| <1,8>

ADD .L1 A4,A20,A20 ; |95| <0,18>
|| ADD .L2 B6,B20,B20 ; |95| <0,18>
|| DOTP2 .M1 A5,A3,A4 ; |95| <1,9>
|| DOTP2 .M2X B8,A3,B6 ; |95| <1,9>
|| LDNDW .D1T1 *A8++(4),A23:A22 ; |95| <2,0>

Regards
JS

________________________________
From: c... [mailto:c...] On Behalf Of Bhooshan Iyer
Sent: Thursday, October 07, 2010 6:45 AM
To: Jeff Brower
Cc: Laurent Gauthier; c...
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction

Since this is a newer instruction targeting C64x+ cores, you should also refer 64x to 64x+ software migration documents like these for clues on how and whether you can get the compiler to generate code using DDOTPL2.

http://focus.ti.com/lit/an/spraa84a/spraa84a.pdf

Apart from this I believe TI also used to employ register TDM on their benchmarks as opposed to loop-unrolling done by the compiler. My understanding of the compiler of those days was that the behaviour was not yet available thru the C compiler.

Its possible TI may release an under-development compiler to you with these extensions, if they exist at all. Worth a shot.

--Bhooshan
On Thu, Oct 7, 2010 at 4:36 PM, Bhooshan Iyer > wrote:
Jeff--
I suggest you write to JS @ TI to see if he has a better compiler oriented approach for you. I know of certain cases where the hand-optimization cannot be matched by the compiler but for most typical loops, compiler optimization should take you fairly close to where you want to go. JS has always been a big champion of letting the compiler work for you.

http://ewh.ieee.org/soc/cas/dallas/documents/Sem-031606-Sankaran_RTV.pdf

http://www.asicfpga.com/site_upgrade/asicfpga/pds/image_pds_files/472.pdf

https://www.cosic.esat.kuleuven.be/publications/article-674.pdf

--Bhooshan
On Thu, Oct 7, 2010 at 8:41 AM, Jeff Brower > wrote:

[Attachment(s) from Jeff Brower included below]

Laurent-
> It really seems that for the case you are talking about the compiler
> intrinsics are the way to go.
Yes, or even a function call to the benchmark routine (written in hand-optimized asm lang). However, we are under a
project constraint to only use standard C code.
> If you want/need more help would you mind sharing at least the code for one
> of the loops that you "think" cannot get (using intrinsics) to the
> performance level you expected?
I've attached the source that we're using to test compiler optimization. It looks like the compiler can generate with
a cycle count of about:

nx * nh * 9

and the hand-written benchmark about:

nx * nh / 8

where nx is data length and nh is filter length. That's a big difference, so we're trying to determine if the
compiler can get closer.

One note about the source: using "negative" x[] indexing produces a markedly slower result, so we're currently
assuming that h[] is stored in reverse order and we would zero pad at end of x[].

-Jeff
> On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower > wrote:
>
>> Andrew-
>> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> long long _ddotpl2(long long
>> src1_o:src1_e, uint src2);
>>
>> But it is not something that I have tried.
>>
>> Thanks Andrew. I've seen that... but I think with intrinsics we're still
>> unable to reach the same level of performance as fir_r8, which is TI's C64x+
>> benchmark routine for convolution. For one core, cycles for fir_r8 is on
>> the order of:
>>
>> nh * nx / 8
>>
>> where nh is filter length and nx is data length. That appears to be
>> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>> instructions in parallel.
>>
>> -Jeff
>> ------------------------------
>> *From:* Jeff Brower >
>> *To:* c...
>> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>>
>> All-
>>
>> We have been unable to find a combination of C source code and compiler
>> options that will cause the TI C64x+ compiler
>> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
>> surprising since super-efficient MAC has
>> been a TI staple for many years.
>>
>> Does anyone (in particular TI persons monitoring this group) know whether
>> there is a way?
>>
>> Also, is there an app note about writing optimized C source code newer than
>> this one:
>>
>> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>>
>> Thanks.
>>
>> -Jeff
>>
>> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>>
>> --
Reply by Andrew Elder October 7, 20102010-10-07
Jeff,

Are you able to make any assumptions about h_len, x_len and array alignment in
memory?

TI advise that you use
#PRAGMA MUST_ITERATE
to support compiler loop unrolling
and
_nassert(((int)x & 0x3) ==0);
to tell the compiler it can use LDDW instructions.

Of course this is no longer standard C !

- Andrew

________________________________
From: Jeff Brower
To: Laurent Gauthier
Cc: c...
Sent: Wed, October 6, 2010 11:11:21 PM
Subject: Re: [c6x] efficient C64x+ code generation and DDOTPL2 instruction [1
Attachment]

[Attachment(s) from Jeff Brower included below]
Laurent-

> It really seems that for the case you are talking about the compiler
> intrinsics are the way to go.

Yes, or even a function call to the benchmark routine (written in hand-optimized
asm lang). However, we are under a
project constraint to only use standard C code.

> If you want/need more help would you mind sharing at least the code for one
> of the loops that you "think" cannot get (using intrinsics) to the
> performance level you expected?

I've attached the source that we're using to test compiler optimization. It
looks like the compiler can generate with
a cycle count of about:

nx * nh * 9

and the hand-written benchmark about:

nx * nh / 8

where nx is data length and nh is filter length. That's a big difference, so
we're trying to determine if the
compiler can get closer.

One note about the source: using "negative" x[] indexing produces a markedly
slower result, so we're currently
assuming that h[] is stored in reverse order and we would zero pad at end of
x[].

-Jeff

> On Tue, Oct 5, 2010 at 4:34 AM, Jeff Brower wrote:
>
>> Andrew-
>> pg 2-25 of spru198i.pdf mentions Compiler Intrinsic
>> long long _ddotpl2(long long
>> src1_o:src1_e, uint src2);
>>
>> But it is not something that I have tried.
>>
>> Thanks Andrew. I've seen that... but I think with intrinsics we're still
>> unable to reach the same level of performance as fir_r8, which is TI's C64x+
>> benchmark routine for convolution. For one core, cycles for fir_r8 is on
>> the order of:
>>
>> nh * nx / 8
>>
>> where nh is filter length and nx is data length. That appears to be
>> achieved by a "few" DDOTPL2s in parallel, plus other groups of various
>> instructions in parallel.
>>
>> -Jeff
>> ------------------------------
>> *From:* Jeff Brower
>> *To:* c...
>> *Sent:* Mon, October 4, 2010 7:50:03 PM
>> *Subject:* [c6x] efficient C64x+ code generation and DDOTPL2 instruction
>>
>> All-
>>
>> We have been unable to find a combination of C source code and compiler
>> options that will cause the TI C64x+ compiler
>> to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that
>> surprising since super-efficient MAC has
>> been a TI staple for many years.
>>
>> Does anyone (in particular TI persons monitoring this group) know whether
>> there is a way?
>>
>> Also, is there an app note about writing optimized C source code newer than
>> this one:
>>
>> http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf
>>
>> Thanks.
>>
>> -Jeff
>>
>> PS. We're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.
>>
>> --
> Laurent Gauthier
>
> "They that can give up essential liberty to obtain a little temporary safety
> deserve neither liberty nor safety."
> --Benjamin Franklin, 1759
>