DSPRelated.com
Forums

About touch assembly routine for C64x

Started by unic...@yahoo.com January 28, 2011
Hi, all,
The following code snippet are for C64x touch assembly routine, which can be available in the TI document, spru656a.

I don't know why we should choose "31" in the 4th line,
|| ADDAW .D2 B4, 31, B4 ; Round up # of iters

Why not select "32"? Let's take a look at a example.
If we have an array, short x[65], the length of which are 130 bytes.
Assume that x[0] is located at 0x0000 0000. So the array occupies 3 cache lines. Through the directive, || SHR .S1X B4, 7, A1, A1 equals to 1. Therefore, only elements x[0] to x[63] are touched, while the element x[64] is not. However, if we choose "32", then we have all elements of the array touched.

Is there anybody who can tell me the reason for choosing "31" or point out the mistake of the example above? Any hints are welcome. Many thanks.

regards,
Yi
;------
_touch
B .S2 loop ; Pipe up the loop
|| MVK .S1 128, A2 ; Step by two cache lines
|| ADDAW .D2 B4, 31, B4 ; Round up # of iters

B .S2 loop ; Pipe up the loop
|| CLR .S1 A4, 0, 6, A4 ; Align to cache line
|| MV .L2X A4, B0 ; Twin the pointer

B .S1 loop ; Pipe up the loop
|| CLR .S2 B0, 0, 6, B0 ; Align to cache line
|| MV .L2X A2, B2 ; Twin the stepping constant

B .S2 loop ; Pipe up the loop
|| SHR .S1X B4, 7, A1 ; Divide by 128 bytes
|| ADDAW .D2 B0, 17, B0 ; Offset by one line + one word

[A1] BDEC .S1 loop, A1 ; Step by 128s through array
|| [A1] LDBU .D1T1 *A4++[A2], A3 ; Load from [128*i + 0]
|| [A1] LDBU .D2T2 *B0++[B2], B4 ; Load from [128*i + 68]
|| SUB .L1 A1, 7, A0

loop:
[A0] BDEC .S1 loop, A0 ; Step by 128s through array
|| [A1] LDBU .D1T1 *A4++[A2], A3 ; Load from [128*i + 0]
|| [A1] LDBU .D2T2 *B0++[B2], B4 ; Load from [128*i + 68]
|| [A1] SUB .L1 A1, 1, A1

BNOP .S2 B3, 5 ; Return
;--------------------------

_____________________________________
Yi-

> The following code snippet are for C64x touch assembly routine, which can be available in the TI document, spru656a.
>
> I don't know why we should choose "31" in the 4th line,
> || ADDAW .D2 B4, 31, B4 ; Round up # of iters

The comment says "round up", so my guess is there is some type of calculation such as this going on:

num_buffers = (num_bytes + buf_len - 1) / buf_len

For example if buf_len is 32 and num_bytes is 31, then num_buffers = 1 only after rounding up.

What is in B4 upon entry? Your post doesn't show that. It looks to me like B4 should be the number of 32-bit values,
but your x[] array is declared as short int. If touch() takes a size param, maybe you have to pass

(sizeof(x)+sizeof(int)-1)/sizeof(int)

in order to pass number of 32-bit values.

-Jeff
> Why not select "32"? Let's take a look at a example.
> If we have an array, short x[65], the length of which are 130 bytes.
> Assume that x[0] is located at 0x0000 0000. So the array occupies 3 cache lines. Through the directive, || SHR .S1X
> B4, 7, A1, A1 equals to 1. Therefore, only elements x[0] to x[63] are touched, while the element x[64] is not.
> However, if we choose "32", then we have all elements of the array touched.
>
> Is there anybody who can tell me the reason for choosing "31" or point out the mistake of the example above? Any hints
> are welcome. Many thanks.
>
> regards,
> Yi
> ;------
> _touch
> B .S2 loop ; Pipe up the loop
> || MVK .S1 128, A2 ; Step by two cache lines
> || ADDAW .D2 B4, 31, B4 ; Round up # of iters
>
> B .S2 loop ; Pipe up the loop
> || CLR .S1 A4, 0, 6, A4 ; Align to cache line
> || MV .L2X A4, B0 ; Twin the pointer
>
> B .S1 loop ; Pipe up the loop
> || CLR .S2 B0, 0, 6, B0 ; Align to cache line
> || MV .L2X A2, B2 ; Twin the stepping constant
>
> B .S2 loop ; Pipe up the loop
> || SHR .S1X B4, 7, A1 ; Divide by 128 bytes
> || ADDAW .D2 B0, 17, B0 ; Offset by one line + one word
>
> [A1] BDEC .S1 loop, A1 ; Step by 128s through array
> || [A1] LDBU .D1T1 *A4++[A2], A3 ; Load from [128*i + 0]
> || [A1] LDBU .D2T2 *B0++[B2], B4 ; Load from [128*i + 68]
> || SUB .L1 A1, 7, A0
>
> loop:
> [A0] BDEC .S1 loop, A0 ; Step by 128s through array
> || [A1] LDBU .D1T1 *A4++[A2], A3 ; Load from [128*i + 0]
> || [A1] LDBU .D2T2 *B0++[B2], B4 ; Load from [128*i + 68]
> || [A1] SUB .L1 A1, 1, A1
>
> BNOP .S2 B3, 5 ; Return
> ;--------------------------
>
>
>
> _____________________________________
>

_____________________________________
Hi, Jeff,
Thanks for your message. Now I provide some more information about the touch routine.

According the following description, we know that B4 is the argument "length". The L1D cache line size is 64 bytes.

In my example, the array occupies 3 lines
Line1: 0,1,2,...,63
Line2: 64,65,66,...,127
Line3: 128,129,...

ADDAW .D2 B4, 31, B4---->130+31*4%4
...
SHR .S1X B4, 7, A1---->254/128 = 1

---->only 2 cache lines is touched.

regards,
Yi

* USAGE *
* This routine is C callable, and has the following C prototype: *
* *
* void touch *
* ( *
* const void *array, /* Pointer to array to touch */ *
* int length /* Length array in bytes */ *
* ); *
* *
* This routine returns no value and discards the loaded data. *
* *
* DESCRIPTION *
* The touch() routine brings an array into the cache by reading *
* elements spaced one cacheline apart in a tight loop. This *
* causes the array to be read into the cache, despite the fact *
* that the data being read is discarded. If the data is already *
* present in the cache, the code has no visible effect. *
* *
* When touching the array, the pointer is first aligned to a cache– *
* line boundary, and the size of the array is rounded up to the *
* next multiple of two cache lines. The array is touched with two *
* parallel accesses that are spaced one cache–line and one bank *
* apart. A multiple of two cache lines is always touched. *

_____________________________________
Yi-

> Thanks for your message. Now I provide some more
> information about the touch routine.
>
> According the following description, we know that B4
> is the argument "length". The L1D cache line size is
> 64 bytes.
>
> In my example, the array occupies 3 lines
> Line1: 0,1,2,...,63
> Line2: 64,65,66,...,127
> Line3: 128,129,...
>
> ADDAW .D2 B4, 31, B4---->130+31*4%4
> ...
> SHR .S1X B4, 7, A1---->254/128 = 1
>
> ---->only 2 cache lines is touched.

Ok the "length" argument is in bytes, cache line length is 64 bytes, and the comments say "A multiple of two cache
lines is always touched". In that case, the basic calculation for number of lines to touch would be:

num_lines = (length + 63)/64

and to get multiple-of-two:

num_lines_m = 2*((num_lines+1)/2) /* must perform /2 and *2 in sequence */

In your example, with length = 130, then num_lines_m = 4. That seems what you want. As a check, if length = 128,
then num_lines_m = 2.

Maybe you could adjust the asm instructions and try this? If it works, then the conclusion would be a bug in the
touch() code. It wouldn't be the first time there is a bug in some TI code :-)

-Jeff
> * USAGE *
> * This routine is C callable, and has the following C prototype: *
> * *
> * void touch *
> * ( *
> * const void *array, /* Pointer to array to touch */ *
> * int length /* Length array in bytes */ *
> * ); *
> * *
> * This routine returns no value and discards the loaded data. *
> * *
> * DESCRIPTION *
> * The touch() routine brings an array into the cache by reading *
> * elements spaced one cacheline apart in a tight loop. This *
> * causes the array to be read into the cache, despite the fact *
> * that the data being read is discarded. If the data is already *
> * present in the cache, the code has no visible effect. *
> * *
> * When touching the array, the pointer is first aligned to a cache– *
> * line boundary, and the size of the array is rounded up to the *
> * next multiple of two cache lines. The array is touched with two *
> * parallel accesses that are spaced one cache–line and one bank *
> * apart. A multiple of two cache lines is always touched. *

_____________________________________
Thaks a lot, Jeff. I will have a try.

Touch is a very useful routine. I don't know why nobody has the same doubt.
Maybe I need more time to understand it. :-)

_____________________________________