DSPRelated.com
Forums

Image Rotate Optimize

Started by shen...@gmail.com December 1, 2010
Hi, there!

I am trying to rotate a YUV image 90 degree clockwise, but the effeciency is not good enough, please help me...

Here is the input image data:
1st row:y00 y01 u00 u01 y02 y03 u02 u03 ....
2nd row:y10 y11 v10 v11 y12 y13 v12 v13 ....
3rd row:y20 y21 u20 u21 y22 y23 u22 u23 ....
4th row:y30 y31 v30 v31 y32 y33 v32 v33 ....
... ...

u00 and y00 correspond to the first pixel,u01 and y01 correspond to the next pixel, and so on. The data of near rows are continously saved at RAM.

What I want to do is rotate the image clockwise for 90 degree, and seperate Y, U, V for YUV422 format, so after rotation, the 1st column of original image should be placed at the 1st line, and the 1st row of original image should be placed at the last column, the data should look like this:

Y:
1st row: ... ... y30 y20 y10 y00
2nd row: ... ... y31 y21 y11 y01
3rd row: ... ... y32 y22 y12 y02
4th row: ... ... y33 y23 y13 y03
... ...
The length of a row is same as the Height of original image
The size of rows the same as the Width of original image

U:
1st row: ... ... u60 u40 u20 u00
2nd row: ... ... u61 u41 u21 u01
... ...
The length of a row is a half of the Height of original image
The size of rows the same as the Width of original image

V:
1st row: ... ... v70 v50 v30 v10
2nd row: ... ... v71 v51 v31 v11
... ...
The length of a row is a half of the Height of original image
The size of rows the same as the Width of original image

I'm using little endian, DM642 at 720MHz, and the SDRAM is 133MHz;

Here is my code:
int yyuuyyvv2planaryuvRoate90(char * pInBuf,char * pOutBuf, const int nWidth, const int nHeight, int nRightLeft)
{
int i, j;
#if 0 //original code, without optimization
char *restrict pSrc, *restrict pY, *restrict pU, *restrict pV;
pSrc = pInBuf;
//the last column of the first row after rotation
pY = pOutBuf + nHeight - 1;
pU = pOutBuf + nHeight * nWidth + (nHeight >> 1) - 1;
pV = pOutBuf + nHeight * nWidth + (nHeight * nWidth >> 1) + (nHeight >> 1) - 1;
for(i = 0; i < (nWidth << 1); i += 4)
{
pSrc = pInBuf + i;
for(j = 0; j < nHeight; j += 2)
{
*pY = *pSrc; //y0
*(pY + nHeight) = *(pSrc + 1); //y1
*pU = *(pSrc + 2);
*(pU + (nHeight >> 1)) = *(pSrc + 3);
pSrc += (nWidth << 1);
*(pY - 1) = *pSrc;
*(pY + nHeight - 1) = *(pSrc + 1);
*pV = *(pSrc + 2);
*(pV + (nHeight >> 1)) = *(pSrc + 3);
pY -= 2;
pU--;
pV--;
pSrc += (nWidth << 1);
}
pY += (3 * nHeight);
pU += 3 * (nHeight >> 1);
pV += 3 * (nHeight >> 1);
}
#else //Optimized code
//Src, Rotate 90 degree clockwise //counterclockwise
/* little endian
u1u0y1y0... ...
v1v0y3y2... ...
u3u2y5y4 ...
v3v2y7y6 ...
... ...
*/

//src pionters
char *restrict pSrcRow0, *restrict pSrcRow1, *restrict pSrcRow2, *restrict pSrcRow3;
//dst pionters
char *restrict pY0, *restrict pY1;
char *restrict pU0, *restrict pU1;
char *restrict pV0, *restrict pV1;
//temporary variables
unsigned int nY0, nY1;
unsigned int nPixel0, nPixel1, nPixel2, nPixel3;
int nYuvWidth = (nWidth << 1);
int nYRowStep = 3 * nHeight;
int nYColumnStep = nYuvWidth * 4;
int nUVStep = 3 * (nHeight >> 1);
//init pointers
pY0 = pOutBuf + nHeight - 4; //last column of the first row
pY1 = pOutBuf + 2 * nHeight - 4;; //same column but the next row of pY0
pU0 = pOutBuf + nHeight * nWidth + (nHeight >> 1) - 1; //last column of the first row
pU1 = pU0 + (nHeight >> 1); //same column but the next row of pU0
pV0 = pOutBuf + nHeight * nWidth + (nHeight * nWidth >> 1) + (nHeight >> 1) - 1; //last column of the first row
pV1 = pV0 + (nHeight >> 1); //same column but the next row of pV0
#pragma MUST_ITERATE(FF_WIDTH/2,FF_WIDTH/2,16);
for(i = 0; i < nYuvWidth; i += 4)
{
pSrcRow0 = pInBuf + i; //first row
pSrcRow1 = pSrcRow0 + nYuvWidth; //second row;
pSrcRow2 = pSrcRow1 + nYuvWidth; //3rd row;
pSrcRow3 = pSrcRow2 + nYuvWidth; //4th row
#pragma MUST_ITERATE(FF_HEIGHT/2,FF_HEIGHT/2,8);
for(j = 0; j < nHeight; j += 4)
{
nPixel0 = _mem4(pSrcRow0); //u1u0y1y0
nPixel1 = _mem4(pSrcRow1); //v1v0y3y2
nPixel2 = _mem4(pSrcRow2); //u3u2y5y4
nPixel3 = _mem4(pSrcRow3); //v3v2y7y6
nY0 = _pack2(nPixel0, nPixel1); //y1y0y3y2
nY1 = _pack2(nPixel2, nPixel3); //y5y4y7y6
_mem4(pY0) = _packl4(nY0, nY1); //y0y2y4y6, row1 after rotate, little endian
_mem4(pY1) = _packh4(nY0, nY1); //y1y3y5y7, row2 after rotate, little endian

*pU0-- = *(pSrcRow0 + 2);
*pU0-- = *(pSrcRow2 + 2);
*pU1-- = *(pSrcRow0 + 3);
*pU1-- = *(pSrcRow2 + 3);
*pV0-- = *(pSrcRow1 + 2);
*pV0-- = *(pSrcRow3 + 2);
*pV1-- = *(pSrcRow1 + 3);
*pV1-- = *(pSrcRow3 + 3);
pSrcRow0 += nYColumnStep; //jump 4 rows
pSrcRow1 += nYColumnStep;
pSrcRow2 += nYColumnStep;
pSrcRow3 += nYColumnStep;
pY0 -= 4; //go to previous column
pY1 -= 4;
}
pY0 += nYRowStep; //go to the last column of the 3rd row below current row
pY1 += nYRowStep;
pU0 += nUVStep;
pU1 += nUVStep;
pV0 += nUVStep;
pV1 += nUVStep;
}
#endif
return 0;
}

I have little experience with C64x optimization, please tell me what should I do for further optimization.

Thanks very much!

Sincerely yours,
Eric Sun

_____________________________________
shengkai,

Adding a couple of 'nassert' statement at the beginning of the function to
assure the data is appropriately aligned in memory (the data should be aligned
on a 8 byte boundary)

Adding a couple of 'nassert' statements at the beginning of the function to
assure the data size is a multiple of 8

Adding a 'unroll' pragma at the top of each loop.

adding a 'restricted' modifier to the images pointers in the passed parameters.

adding a 'register' modifier to the local/automatic image pointer definitions.
Your code seems to be assuming that the incoming image will contain a number of
rows that will be a multiple of 4.

Adding a 'nassert' statement that assures this is a fact.

adding the appropriate statements so the DSP internal loop buffer will be used.
(this may require modifying the code to have 3 loops rather than just one)

The FF_WIDTH and FF_HEIGHT values used in the 'must itterate' pragma seem
unrelated to the actual image size. This will result in problems when the image
size is not a match for the FF_WIDTH and FF_HEIGHT values.

R. Williams

---------- Original Message -----------
From: s...@gmail.com
To: c...
Sent: Tue, 30 Nov 2010 21:34:28 -0500
Subject: [c6x] Image Rotate Optimize

> Hi, there!
>
> I am trying to rotate a YUV image 90 degree clockwise, but the
> effeciency is not good enough, please help me...
>
> Here is the input image data:
> 1st row:y00 y01 u00 u01 y02 y03 u02 u03 ....
> 2nd row:y10 y11 v10 v11 y12 y13 v12 v13 ....
> 3rd row:y20 y21 u20 u21 y22 y23 u22 u23 ....
> 4th row:y30 y31 v30 v31 y32 y33 v32 v33 ....
> ... ...
>
> u00 and y00 correspond to the first pixel,u01 and y01 correspond to
> the next pixel, and so on. The data of near rows are continously saved
> at RAM.
>
> What I want to do is rotate the image clockwise for 90 degree, and
> seperate Y, U, V for YUV422 format, so after rotation, the 1st column
> of original image should be placed at the 1st line, and the 1st row of
> original image should be placed at the last column, the data should
> look like this:
>
> Y:
> 1st row: ... ... y30 y20 y10 y00
> 2nd row: ... ... y31 y21 y11 y01
> 3rd row: ... ... y32 y22 y12 y02
> 4th row: ... ... y33 y23 y13 y03
> ... ...
> The length of a row is same as the Height of original image
> The size of rows the same as the Width of original image
>
> U:
> 1st row: ... ... u60 u40 u20 u00
> 2nd row: ... ... u61 u41 u21 u01
> ... ...
> The length of a row is a half of the Height of original image
> The size of rows the same as the Width of original image
>
> V:
> 1st row: ... ... v70 v50 v30 v10
> 2nd row: ... ... v71 v51 v31 v11
> ... ...
> The length of a row is a half of the Height of original image
> The size of rows the same as the Width of original image
>
> I'm using little endian, DM642 at 720MHz, and the SDRAM is 133MHz;
>
> Here is my code:
> int yyuuyyvv2planaryuvRoate90(char * pInBuf,char * pOutBuf, const int
> nWidth, const int nHeight, int nRightLeft) { int i, j;
> #if 0 //original code, without optimization
> char *restrict pSrc, *restrict pY, *restrict pU, *restrict pV;
> pSrc = pInBuf;
> //the last column of the first row after rotation
> pY = pOutBuf + nHeight - 1;
> pU = pOutBuf + nHeight * nWidth + (nHeight >> 1) - 1;
> pV = pOutBuf + nHeight * nWidth + (nHeight * nWidth >> 1) + (nHeight
> >> 1) - 1; for(i = 0; i < (nWidth << 1); i += 4) { pSrc = pInBuf +
> i; for(j = 0; j < nHeight; j += 2) { *pY = *pSrc; //y0 *(pY
> + nHeight) = *(pSrc + 1); //y1 *pU = *(pSrc + 2); *(pU +
> (nHeight >> 1)) = *(pSrc + 3); pSrc += (nWidth << 1); *(pY - 1)
> = *pSrc; *(pY + nHeight - 1) = *(pSrc + 1); *pV = *(pSrc + 2);
> *(pV + (nHeight >> 1)) = *(pSrc + 3); pY -= 2; pU--; pV--;
> pSrc += (nWidth << 1); } pY += (3 * nHeight); pU += 3 *
> (nHeight >> 1); pV += 3 * (nHeight >> 1); }
> #else //Optimized code
> //Src, Rotate 90 degree clockwise //counterclockwise
> /* little endian
> u1u0y1y0... ...
> v1v0y3y2... ...
> u3u2y5y4 ...
> v3v2y7y6 ...
> ... ...
> */
>
> //src pionters
> char *restrict pSrcRow0, *restrict pSrcRow1, *restrict pSrcRow2,
> *restrict pSrcRow3; //dst pionters char *restrict pY0, *restrict
> pY1; char *restrict pU0, *restrict pU1; char *restrict pV0,
> *restrict pV1; //temporary variables unsigned int nY0, nY1;
> unsigned int nPixel0, nPixel1, nPixel2, nPixel3; int nYuvWidth > (nWidth << 1); int nYRowStep = 3 * nHeight; int nYColumnStep > nYuvWidth * 4; int nUVStep = 3 * (nHeight >> 1); //init pointers
> pY0 = pOutBuf + nHeight - 4; //last column of the first row pY1 > pOutBuf + 2 * nHeight - 4;; //same column but the next row of pY0 pU0
> = pOutBuf + nHeight * nWidth + (nHeight >> 1) - 1; //last column of
> the first row pU1 = pU0 + (nHeight >> 1); //same column but the next
> row of pU0 pV0 = pOutBuf + nHeight * nWidth + (nHeight * nWidth >> 1)
> + (nHeight >> 1) - 1; //last column of the first row pV1 = pV0 +
> (nHeight >> 1); //same column but the next row of pV0 #pragma
> MUST_ITERATE(FF_WIDTH/2,FF_WIDTH/2,16); for(i = 0; i < nYuvWidth; i
> += 4) { pSrcRow0 = pInBuf + i; //first row pSrcRow1 = pSrcRow0 +
> nYuvWidth; //second row; pSrcRow2 = pSrcRow1 + nYuvWidth; //3rd row;
> pSrcRow3 = pSrcRow2 + nYuvWidth; //4th row #pragma
> MUST_ITERATE(FF_HEIGHT/2,FF_HEIGHT/2,8); for(j = 0; j < nHeight; j
> += 4) { nPixel0 = _mem4(pSrcRow0); //u1u0y1y0 nPixel1 > _mem4(pSrcRow1); //v1v0y3y2 nPixel2 = _mem4(pSrcRow2); //u3u2y5y4
> nPixel3 = _mem4(pSrcRow3); //v3v2y7y6 nY0 = _pack2(nPixel0,
> nPixel1); //y1y0y3y2 nY1 = _pack2(nPixel2, nPixel3); //y5y4y7y6
> _mem4(pY0) = _packl4(nY0, nY1); //y0y2y4y6, row1 after rotate, little endian
> _mem4(pY1) = _packh4(nY0, nY1); //y1y3y5y7, row2 after rotate,
> little endian
>
> *pU0-- = *(pSrcRow0 + 2);
> *pU0-- = *(pSrcRow2 + 2);
> *pU1-- = *(pSrcRow0 + 3);
> *pU1-- = *(pSrcRow2 + 3);
> *pV0-- = *(pSrcRow1 + 2);
> *pV0-- = *(pSrcRow3 + 2);
> *pV1-- = *(pSrcRow1 + 3);
> *pV1-- = *(pSrcRow3 + 3);
> pSrcRow0 += nYColumnStep; //jump 4 rows
> pSrcRow1 += nYColumnStep;
> pSrcRow2 += nYColumnStep;
> pSrcRow3 += nYColumnStep;
> pY0 -= 4; //go to previous column
> pY1 -= 4;
> }
> pY0 += nYRowStep; //go to the last column of the 3rd row below
> current row pY1 += nYRowStep; pU0 += nUVStep; pU1 += nUVStep;
> pV0 += nUVStep; pV1 += nUVStep; }
> #endif
> return 0;
> }
>
> I have little experience with C64x optimization, please tell me what
> should I do for further optimization.
>
> Thanks very much!
>
> Sincerely yours,
> Eric Sun
>
>
>
> _____________________________________

_____________________________________
Thansk for your reply, Williams!

I have modified the code according to what you said, but the performance is also the same. I checked the asm code and maybe the bottleneck lies in data fetching. Here is the modified code:

#define ALIGNED_ARRAY8(ptr) _nassert((int)ptr % 8 == 0)
#define ALIGNED_ARRAY4(ptr) _nassert((int)ptr % 4 == 0)
int VixEye_format_yyuuyyvv2planaryuvRoate90(char *restrict pInBuf, char *restrict pOutBuf, const int nWidth, const int nHeight, int nRightLeft)
{
int i, j;

//src pionters
register char *restrict pSrcRow0 = pInBuf;
register char *restrict pSrcRow1 = pInBuf;
register char *restrict pSrcRow2 = pInBuf;
register char *restrict pSrcRow3 = pInBuf;
register char *restrict pSrcRow4 = pInBuf;
register char *restrict pSrcRow5 = pInBuf;
register char *restrict pSrcRow6 = pInBuf;
register char *restrict pSrcRow7 = pInBuf;

//dst pionters
register char *restrict pY0;
register char *restrict pU0;
register char *restrict pV0;
register unsigned int nOffset = nHeight;
register unsigned int nPixel0, nPixel1, nPixel2, nPixel3, nPixel4, nPixel5, nPixel6, nPixel7;
register unsigned int nTemp0, nTemp1;
register int nYuvWidth = (nWidth << 1);

pY0 = pOutBuf + nHeight - 4;
pU0 = pOutBuf + nWidth * nHeight + (nHeight >> 1) - 4;
pV0 = pOutBuf + nWidth * nHeight + (nWidth * nHeight >> 1) + (nHeight >> 1) - 4;

ALIGNED_ARRAY8(pInBuf);
ALIGNED_ARRAY8(pOutBuf);
ALIGNED_ARRAY4(pSrcRow0);
ALIGNED_ARRAY4(pSrcRow1);
ALIGNED_ARRAY4(pSrcRow2);
ALIGNED_ARRAY4(pSrcRow3);
ALIGNED_ARRAY4(pY0);
ALIGNED_ARRAY4(pU0);
ALIGNED_ARRAY4(pV0);
//#pragma MUST_ITERATE(AOI_WIDTH/4,FF_WIDTH/4,16); //AOI_WIDTH = 752, FF_WIDTH = 1600
#pragma UNROLL(8);
for(i = 0; i < nYuvWidth; i += 4)
{
//src points to the next 4 columns
pSrcRow0 = pInBuf + i; //first row
pSrcRow1 = pSrcRow0 + nYuvWidth; //second row;
pSrcRow2 = pSrcRow1 + nYuvWidth; //3rd row;
pSrcRow3 = pSrcRow2 + nYuvWidth; //4th row
pSrcRow4 = pSrcRow3 + nYuvWidth; //5th row;
pSrcRow5 = pSrcRow4 + nYuvWidth; //6th row;
pSrcRow6 = pSrcRow5 + nYuvWidth; //7th row
pSrcRow7 = pSrcRow6 + nYuvWidth; //8th row
ALIGNED_ARRAY4(pSrcRow0);
ALIGNED_ARRAY4(pSrcRow1);
ALIGNED_ARRAY4(pSrcRow2);
ALIGNED_ARRAY4(pSrcRow3);
ALIGNED_ARRAY4(pSrcRow4);
ALIGNED_ARRAY4(pSrcRow5);
ALIGNED_ARRAY4(pSrcRow6);
ALIGNED_ARRAY4(pSrcRow7);
ALIGNED_ARRAY4(pY0);
ALIGNED_ARRAY4(pU0);
ALIGNED_ARRAY4(pV0);
//#pragma MUST_ITERATE(AOI_HEIGHT/8,FF_HEIGHT/8,2); //AOI_HEIGHT = 480, FF_HEIGHT = 1200
#pragma UNROLL(2);
for(j = 0; j < nHeight; j += 8)
{
nPixel0 = _mem4(pSrcRow0); //u1u0y1y0
nPixel1 = _mem4(pSrcRow1); //v1v0y3y2
nOffset = nHeight; //length of Y after rotation
nTemp0 = _pack2(nPixel0, nPixel1); //y1y0y3y2

nPixel2 = _mem4(pSrcRow2); //u3u2y5y4
nPixel3 = _mem4(pSrcRow3); //v3v2y7y6
nTemp1 = _pack2(nPixel2, nPixel3); //y5y4y7y6
_mem4(pY0) = _packl4(nTemp0, nTemp1); //y0y2y4y6, row1 after rotate, little endian
_mem4(pY0 + nOffset) = _packh4(nTemp0, nTemp1); //y1y3y5y7, row2 after rotate, little endian

pY0 -= 4;
nPixel4 = _mem4(pSrcRow4); //u5u4y9y8
nPixel5 = _mem4(pSrcRow5); //v5v4y11y10
nTemp0 = _pack2(nPixel4, nPixel5); //y9y8y11y10
nPixel6 = _mem4(pSrcRow6); //u7u6y13y12
nPixel7 = _mem4(pSrcRow7); //v7v6y15y14
nTemp1 = _pack2(nPixel6, nPixel7); //y13y12y15y14
_mem4(pY0) = _packl4(nTemp0, nTemp1); //y8y10y12y14, row1 after rotate, little endian
_mem4(pY0 + nOffset) = _packh4(nTemp0, nTemp1); //y9y11y13y15, row2 after rotate, little endian

nOffset >>= 1; //divided by 2, because width of U, V is 1/2 of Y
nTemp0 = _packh2(nPixel0, nPixel2); //u1u0u3u2
nTemp1 = _packh2(nPixel4, nPixel6); //u5u4u7u6
_mem4(pU0) = _packl4(nTemp0, nTemp1); //u0u2u4u6, row1 after rotate, little endian
_mem4(pU0 + nOffset) = _packh4(nTemp0, nTemp1); //u1u3u5u7, row2 after rotate, little endian

nTemp0 = _packh2(nPixel1, nPixel3); //v1v0v3v2
nTemp1 = _packh2(nPixel5, nPixel7); //v5v4v7v6
_mem4(pV0) = _packl4(nTemp0, nTemp1); //v0v2v4v6
_mem4(pV0 + nOffset) = _packh4(nTemp0, nTemp1); //v1v3vv5v7

nOffset = nYuvWidth << 3;
pSrcRow0 += nOffset; //jump 8 rows
pSrcRow1 += nOffset;
pSrcRow2 += nOffset;
pSrcRow3 += nOffset;
pSrcRow5 += nOffset;
pSrcRow6 += nOffset;
pSrcRow7 += nOffset;
pSrcRow4 += nOffset;
pY0 -= 4; //go to previous column
pU0 -= 4;
pV0 -= 4;
}

//Dst points to next 2 rows, and the last column
pY0 += 3 * nHeight;
pU0 += (3 * nHeight >> 1);
pV0 += (3 * nHeight >> 1);
}
}

The followings are the generated asm code,
Is there anything I can do to improve the effciency?
Thanks!!

;******************************************************************************
;* FUNCTION NAME: yyuuyyvv2planaryuvRoate90 *
;* *
;* Regs Modified : A0,A1,A3,A4,A5,A6,A7,A8,A9,B0,B1,B2,B4,B5,B6,B7,B8, *
;* B9,A16,A17,A18,A19,A20,A21,A22,A23,A24,A25,A26, *
;* A27,A28,A29,A30,A31,B16,B17,B18,B19,B20,B21,B22, *
;* B23,B24,B25,B26,B27,B30,B31 *
;* Regs Used : A0,A1,A3,A4,A5,A6,A7,A8,A9,B0,B1,B2,B3,B4,B5,B6,B7, *
;* B8,B9,A16,A17,A18,A19,A20,A21,A22,A23,A24,A25, *
;* A26,A27,A28,A29,A30,A31,B16,B17,B18,B19,B20,B21, *
;* B22,B23,B24,B25,B26,B27,B30,B31 *
;* Local Frame Size : 0 Args + 0 Auto + 0 Save = 0 byte *
;******************************************************************************
_VixEye_format_yyuuyyvv2planaryuvRoate90:
;** --*
; EXCLUSIVE CPU CYCLES: 14
;** 8 ----------------------- pInBuf = pInBuf;
;** 8 ----------------------- pOutBuf = pOutBuf;
;** 132 ----------------------- nYuvWidth = nWidth*2;
;** 134 ----------------------- pY0 = &pOutBuf[nHeight-4];
;** 135 ----------------------- C$6 = nWidth*nHeight;
;** 135 ----------------------- C$5 = nHeight>>1;
;** 135 ----------------------- pU0 = &pOutBuf[C$5+C$6-4];
;** 136 ----------------------- pV0 = &pOutBuf[(C$6>>1)+C$5+C$6-4];
;** 149 ----------------------- if ( nYuvWidth <= 0 ) goto g10;
;** ----------------------- U$34 = pInBuf;
;** ----------------------- K$185 = _lo(_mpyli(3, nHeight));
;** ----------------------- K$187 = K$185>>1;
;** 152 ----------------------- L$1 = nYuvWidth+3>>2;
;** ----------------------- #pragma MUST_ITERATE(1, 536870911, 1)
;** ----------------------- #pragma UNROLL(8)
;** ----------------------- #pragma LOOP_FLAGS(4096u)
MV .L1 A6,A3 ; |8|
|| MV .L2 B6,B26 ; |8|
|| ADD .S1 A6,A6,A30 ; |132|
|| MV .D1 A4,A28 ; |8|
CMPGT .L1 A30,0,A0 ; |149|
MPYLH .M2X B26,A3,B5 ; |135|
MPYLH .M1X A3,B26,A7 ; |135|
MPYU .M1X B26,A3,A5 ; |135|
ADD .L1X B5,A7,A3 ; |135|
|| MVK .L2 3,B5
MPYLI .M2 B5,B26,B9:B8
|| SHL .S1 A3,16,A3 ; |135|
|| SHR .S2 B26,1,B5 ; |135|
ADD .L1 A5,A3,A3 ; |135|
SHR .S1 A3,1,A5 ; |136|
|| [!A0] B .S2 $C$L7 ; |149|
ADD .L2X A3,B5,B6 ; |135|
ADD .L2X B5,A5,B5 ; |136|
|| ADD .L1 3,A30,A5 ; |152|
ADD .L2X A3,B5,B7 ; |136|
|| ADD .S2 B26,B4,B5 ; |134|
|| MV .L1X B8,A29
ADD .L2 B4,B7,B6 ; |136|
|| ADD .D2 B4,B6,B4 ; |135|
|| SHR .S2X A5,2,B1 ; |152|
SUB .L1X B5,4,A21 ; |134|
|| SHR .S2X A29,1,B27
|| SUB .L2 B6,4,B24 ; |136|
|| SUB .D2 B4,4,B23 ; |135|
; BRANCHCC OCCURS {$C$L7} ; |149|
;** --*
; EXCLUSIVE CPU CYCLES: 5
ADD .S2 7,B26,B4
|| ADD .L1 A30,A28,A9 ; |153|
|| CMPGT .L2 B26,0,B0 ; |174|
|| MV .S1 A28,A8 ; |152|
SHR .S2 B4,2,B4
|| ADD .L1 A30,A9,A6 ; |154|
SHRU .S2 B4,29,B4
|| ADD .L1 A30,A6,A16 ; |155|
|| MV .S1X B0,A0 ; guard predicate rewrite
ADD .L2 B26,B4,B4
|| ADD .L1 A30,A16,A5 ; |156|
|| MV .S1X B0,A1 ; |174| branch predicate copy
.dwpsn file "YUVRotate.c",line 149,column 0,is_stmt
ADD .L2 7,B4,B4
|| ADD .L1 A30,A5,A3 ; |157|
;** --*
;** BEGIN LOOP $C$L1
;** --*
$C$L1:
$C$DW$L$_VixEye_format_yyuuyyvv2planaryuvRoate90$3$B:
.dwpsn file "YUVRotate.c",line 150,column 0,is_stmt
; EXCLUSIVE CPU CYCLES: 6
;** -----------------------g3:
;** 152 ----------------------- pSrcRow0 = U$34;
;** 153 ----------------------- pSrcRow1 = &pSrcRow0[nYuvWidth];
;** 154 ----------------------- pSrcRow2 = &pSrcRow1[nYuvWidth];
;** 155 ----------------------- pSrcRow3 = &pSrcRow2[nYuvWidth];
;** 156 ----------------------- pSrcRow4 = &pSrcRow3[nYuvWidth];
;** 157 ----------------------- pSrcRow5 = &pSrcRow4[nYuvWidth];
;** 158 ----------------------- pSrcRow6 = &pSrcRow5[nYuvWidth];
;** 159 ----------------------- pSrcRow7 = &pSrcRow6[nYuvWidth];
;** 174 ----------------------- if ( nHeight <= 0 ) goto g9;
[!B0] B .S1 $C$L6 ; |174|
|| ADD .L1 A30,A3,A4 ; |158|
|| [!A1] SUB .L2 B1,1,B1 ; |149|
|| AND .S2 8,B4,B2
CMPGT .L2 B26,8,B0 ; |176|
|| ADD .L1 A30,A4,A7 ; |159|
[!A0] MVK .L2 0x1,B0 ; |176| nullify predicate
[!B0] BNOP .S1 $C$L5,2 ; |176|
; BRANCHCC OCCURS {$C$L6} ; |174|
$C$DW$L$_VixEye_format_yyuuyyvv2planaryuvRoate90$3$E:
;** --*
$C$DW$L$_VixEye_format_yyuuyyvv2planaryuvRoate90$4$B:
; EXCLUSIVE CPU CYCLES: 3
;** ----------------------- C$3 = (nHeight+7>>2>>29)+nHeight+7;
;** ----------------------- K$174 = C$3&0x8;
;** 176 ----------------------- if ( nHeight <= 8 ) goto g7;
[ B0] SHRU .S1X B26,1,A17
[ B0] MV .L2X A5,B5
[ B0] ADD .L1X A21,B26,A5
; BRANCHCC OCCURS {$C$L5} ; |176|
$C$DW$L$_VixEye_format_yyuuyyvv2planaryuvRoate90$4$E:
;** --*
$C$DW$L$_VixEye_format_yyuuyyvv2planaryuvRoate90$5$B:
; EXCLUSIVE CPU CYCLES: 5
;** ----------------------- U$102 = nHeight+pY0;
;** ----------------------- C$4 = (unsigned)nHeight>>1;
;** ----------------------- U$116 = &pU0[C$4];
;** ----------------------- U$127 = &pV0[C$4];
;** ----------------------- U$131 = nYuvWidth<<3;
;** ----------------------- L$2 = C$3>>4;
;** ----------------------- #pragma MUST_ITERATE(1, 134217727, 1)
;** ----------------------- #pragma UNROLL(1)
;** ----------------------- // LOOP BELOW UNROLLED BY FACTOR(2)
;** ----------------------- #pragma LOOP_FLAGS(4103u)
;** -----------------------g6:
;** 174 ----------------------- nPixel0 = _mem4((void *)pSrcRow0);
;** 177 ----------------------- nPixel1 = _mem4((void *)pSrcRow1);
;** 179 ----------------------- nTemp0 = _pack2(nPixel0, nPixel1);
;** 180 ----------------------- nPixel2 = _mem4((void *)pSrcRow2);
;** 181 ----------------------- nPixel3 = _mem4((void *)pSrcRow3);
;** 182 ----------------------- nTemp1 = _pack2(nPixel2, nPixel3);
;** 187 ----------------------- nPixel4 = _mem4((void *)pSrcRow4);
;** 188 ----------------------- nPixel5 = _mem4((void *)pSrcRow5);
;** 189 ----------------------- nTemp0 = _pack2(nPixel4, nPixel5);
;** 190 ----------------------- nPixel6 = _mem4((void *)pSrcRow6);
;** 191 ----------------------- nPixel7 = _mem4((void *)pSrcRow7);
;** 192 ----------------------- nTemp1 = _pack2(nPixel6, nPixel7);
;** 193 ----------------------- _memd8((void *)(pY0-4)) = __pack12(_packl4(nTemp0, nTemp1), _packl4(nTemp0, nTemp1));
;** 194 ----------------------- _memd8((void *)(U$102-4)) = __pack12(_packh4(nTemp0, nTemp1), _packh4(nTemp0, nTemp1));
;** 198 ----------------------- nTemp0 = _packh2(nPixel0, nPixel2);
;** 199 ----------------------- nTemp1 = _packh2(nPixel4, nPixel6);
;** 200 ----------------------- _mem4((void *)pU0) = _packl4(nTemp0, nTemp1);
;** 201 ----------------------- _mem4((void *)U$116) = _packh4(nTemp0, nTemp1);
;** 203 ----------------------- nTemp0 = _packh2(nPixel1, nPixel3);
;** 204 ----------------------- nTemp1 = _packh2(nPixel5, nPixel7);
;** 205 ----------------------- _mem4((void *)pV0) = _packl4(nTemp0, nTemp1);
;** 206 ----------------------- _mem4((void *)U$127) = _packh4(nTemp0, nTemp1);
;** 209 ----------------------- pSrcRow0 += U$131;
;** 210 ----------------------- pSrcRow1 += U$131;
;** 211 ----------------------- pSrcRow2 += U$131;
;** 212 ----------------------- pSrcRow3 += U$131;
;** 213 ----------------------- pSrcRow5 += U$131;
;** 214 ----------------------- pSrcRow6 += U$131;
;** 215 ----------------------- pSrcRow7 += U$131;
;** 216 ----------------------- pSrcRow4 += U$131;
;** 174 ----------------------- nPixel0 = _mem4((void *)pSrcRow0);
;** 177 ----------------------- nPixel1 = _mem4((void *)pSrcRow1);
;** 179 ----------------------- nTemp0 = _pack2(nPixel0, nPixel1);
;** 180 ----------------------- nPixel2 = _mem4((void *)pSrcRow2);
;** 181 ----------------------- nPixel3 = _mem4((void *)pSrcRow3);
;** 182 ----------------------- nTemp1 = _pack2(nPixel2, nPixel3);
;** 184 ----------------------- _mem4((void *)((unsigned *)U$102-8)) = _packh4(nTemp0, nTemp1);
;** 187 ----------------------- nPixel4 = _mem4((void *)pSrcRow4);
;** 188 ----------------------- nPixel5 = _mem4((void *)pSrcRow5);
;** 189 ----------------------- nTemp0 = _pack2(nPixel4, nPixel5);
;** 190 ----------------------- nPixel6 = _mem4((void *)pSrcRow6);
;** 191 ----------------------- nPixel7 = _mem4((void *)pSrcRow7);
;** 192 ----------------------- nTemp1 = _pack2(nPixel6, nPixel7);
;** 193 ----------------------- _memd8((void *)(pY0-12)) = __pack12(_packl4(nTemp0, nTemp1), _packl4(nTemp0, nTemp1));
;** 194 ----------------------- _mem4((void *)((unsigned *)U$102-12)) = _packh4(nTemp0, nTemp1);
;** 198 ----------------------- nTemp0 = _packh2(nPixel0, nPixel2);
;** 199 ----------------------- nTemp1 = _packh2(nPixel4, nPixel6);
;** 200 ----------------------- _mem4((void *)(pU0-4)) = _packl4(nTemp0, nTemp1);
;** 201 ----------------------- _mem4((void *)((unsigned *)U$116-4)) = _packh4(nTemp0, nTemp1);
;** 203 ----------------------- nTemp0 = _packh2(nPixel1, nPixel3);
;** 204 ----------------------- nTemp1 = _packh2(nPixel5, nPixel7);
;** 205 ----------------------- _mem4((void *)(pV0-4)) = _packl4(nTemp0, nTemp1);
;** 206 ----------------------- _mem4((void *)((unsigned *)U$127-4)) = _packh4(nTemp0, nTemp1);
;** 209 ----------------------- pSrcRow0 += U$131;
;** 210 ----------------------- pSrcRow1 += U$131;
;** 211 ----------------------- pSrcRow2 += U$131;
;** 212 ----------------------- pSrcRow3 += U$131;
;** 213 ----------------------- pSrcRow5 += U$131;
;** 214 ----------------------- pSrcRow6 += U$131;
;** 215 ----------------------- pSrcRow7 += U$131;
;** 216 ----------------------- pSrcRow4 += U$131;
;** 217 ----------------------- U$102 -= 16;
;** 217 ----------------------- pY0 -= 16;
;** 218 ----------------------- U$116 -= 8;
;** 218 ----------------------- pU0 -= 8;
;** 219 ----------------------- U$127 -= 8;
;** 219 ----------------------- pV0 -= 8;
;** 174 ----------------------- if ( !__builtin_expect((long)!(L$2 = L$2-1), 0L) ) goto g6;
SHR .S2 B4,4,B0
|| MV .L2X A4,B4
|| SHL .S1 A30,3,A22
|| SUB .L1 A5,4,A25
|| ADD .D1X B23,A17,A24
MV .L2X A3,B6
|| ADD .L1X B24,A17,A23
MV .L2X A8,B8
MV .L2X A9,B18
SUB .L2X A21,4,B22
$C$DW$L$_VixEye_format_yyuuyyvv2planaryuvRoate90$5$E:
;*----*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop source line : 174
;* Loop opening brace source line : 175
;* Loop closing brace source line : 220
;* Loop Unroll Multiple : 2x
;* Known Minimum Trip Count : 1
;* Known Max Trip Count Factor : 1
;* Loop Carried Dependency Bound(^) : 5
;* Unpartitioned Resource Bound : 15
;* Partitioned Resource Bound(*) : 29
;* Resource Partition:
;* A-side B-side
;* .L units 8 8
;* .S units 0 1
;* .D units 15 14
;* .M units 0 0
;* .X cross paths 4 14
;* .T address paths 29* 29*
;* Long read paths 0 0
;* Long write paths 0 0
;* Logical ops (.LS) 8 8 (.L or .S unit)
;* Addition ops (.LSD) 10 9 (.L or .S or .D unit)
;* Bound(.L .S .LS) 8 9
;* Bound(.L .S .D .LS .LSD) 14 14
;*
;* Searching for software pipeline schedule at ...
;* ii = 29 Schedule found with 1 iterations in parallel
;*
;* Register Usage Table:
;* +-----------------------------+
;* |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
;* |00000000001111111111222222222233|00000000001111111111222222222233|
;* |01234567890123456789012345678901|01234567890123456789012345678901|
;* |--------------------------------+--------------------------------|
;* 0: | ** ** * **** |* *** * * *** |
;* 1: | ** ** * **** |* *** * * *** |
;* 2: | ** ** * **** |* *** * ** *** |
;* 3: | ** ** * **** |* ***** ** *** |
;* 4: | ** ** * **** |* ****** ** *** |
;* 5: | ***** * **** |* ****** ** *** |
;* 6: | ***** * **** |* ***** *** *** |
;* 7: | ****** * **** |* ***** *** *** |
;* 8: | ******* * **** |* ***** **** *** |
;* 9: | ******* ** **** |* ***** ** * **** |
;* 10: | ****** ******* * |* * **** ** * **** |
;* 11: | ****** * ******* * |* * **** ** * ***** |
;* 12: | ****** ** ******* * |* * **** ** * ***** |
;* 13: | ****** ** ******* * |* ****** * ** ***** |
;* 14: | ****** ** ******* * |* ****** *** ***** |
;* 15: | ****** ********** * |* ****** ********** |
;* 16: | ****** ********** * |* ****** ********** |
;* 17: | ****** ************ |* ****** ********** |
;* 18: | ****** * ********** |* ***** ** ******* |
;* 19: | * **** * ********** |* ***** * ******** |
;* 20: | *** ** ************ |* ***** * ** **** |
;* 21: | *** *** ************ |* ***** **** |
;* 22: | *** *** ************ |* ***** **** |
;* 23: | ******* ********** |* ** * **** |
;* 24: | **** * ********** |* ** * **** |
;* 25: | **** * ********** |* ** * **** |
;* 26: | ***** ********* |* * * **** |
;* 27: | ***** ******* |* * * * *** |
;* 28: | **** ****** |* ** * * *** |
;* +-----------------------------+
;*
;* Done
;*
;* Loop is interruptible
;* Collapsed epilog stages : 0
;* Collapsed prolog stages : 0
;*
;* Minimum safe trip count : 1 (after unrolling)
;*
;*
;* Mem bank conflicts/iter(est.) : { min 0.000, est 0.000, max 0.000 }
;* Mem bank perf. penalty (est.) : 0.0%
;*
;*
;* Total cycles (est.) : 0 + trip_cnt * 29
;*----*
;* SETUP CODE
;*
;* SUB A25,4,A25
;* SUB B22,4,B22
;* MV A4,B4
;* MV A3,B6
;*
;* SINGLE SCHEDULED ITERATION
;*
;* $C$C42:
;* 0 LDNW .D2T2 *B6,B6 ; |188| ^
;* 1 LDNW .D2T2 *B18,B17 ; |177| ^
;* 2 LDNW .D2T2 *B8,B7 ; |174| ^
;* 3 LDNW .D2T2 *B5,B9 ; |187| ^
;* 4 LDNW .D1T1 *A7,A5 ; |191| ^
;* 5 LDNW .D2T2 *B4,B16 ; |190| ^
;* 6 LDNW .D1T1 *A6,A8 ; |180| ^
;* 7 PACK2 .L2 B7,B17,B19 ; |179| ^
;* || LDNW .D1T1 *A16,A9 ; |181| ^
;* || ADD .S2X A22,B8,B8 ; |209|
;* 8 ADD .L2X A22,B18,B25 ; |210|
;* || ADD .S1 A22,A6,A19 ; |211|
;* || ADD .L1 A22,A16,A20 ; |212|
;* || LDNW .D2T1 *B8,A6 ; |174| ^
;* 9 ADD .L1 A22,A3,A27 ; |213|
;* || ADD .S1 A22,A7,A21 ; |215|
;* || ADD .L2X A22,B5,B4 ; |216|
;* || LDNW .D2T1 *B25,A7 ; |177| ^
;* 10 PACK2 .L2X B16,A5,B21 ; |192| ^
;* || LDNW .D1T1 *A27,A16 ; |188| ^
;* 11 PACKH2 .S2X B7,A8,B7 ; |198| ^
;* || PACKH2 .L2 B9,B16,B16 ; |199| ^
;* || ADD .L1 A22,A4,A4 ; |214|
;* || LDNW .D2T1 *B4,A17 ; |187| ^
;* 12 PACK2 .L1 A8,A9,A3 ; |182| ^
;* || PACKH4 .L2 B7,B16,B18 ; |201|
;* || PACKH2 .S2X B17,A9,B5 ; |203| ^
;* || LDNW .D1T1 *A4,A8 ; |190| ^
;* 13 PACKL4 .L2 B7,B16,B7 ; |200|
;* || PACKH2 .S2X B6,A5,B17 ; |204| ^
;* || LDNW .D1T1 *A21,A5 ; |191| ^
;* 14 PACKH4 .L2 B5,B17,B20 ; |206|
;* || PACK2 .L1 A6,A7,A18 ; |179| ^
;* || LDNW .D1T2 *A20,B16 ; |181| ^
;* 15 PACKL4 .L2 B5,B17,B17 ; |205|
;* || LDNW .D1T2 *A19,B5 ; |180| ^
;* 16 PACKH4 .L2X B19,A3,B7 ; |194|
;* || STNW .D2T2 B7,*B23--(8) ; |200|
;* || PACK2 .L1 A17,A16,A26 ; |189| ^
;* 17 PACK2 .S2 B9,B6,B6 ; |189| ^
;* || PACKL4 .L2X B19,A3,B19 ; |193|
;* || STNW .D1T2 B18,*A24--(8) ; |201|
;* || PACKH2 .L1 A17,A8,A9 ; |199| ^
;* 18 PACKL4 .L2 B6,B21,B18 ; |193|
;* || STNW .D2T2 B17,*B24--(8) ; |205|
;* || PACK2 .L1 A8,A5,A8 ; |192| ^
;* || PACKH2 .S1 A16,A5,A16 ; |204| ^
;* 19 PACKH4 .L2 B6,B21,B6 ; |194|
;* || STNW .D1T2 B20,*A23--(8) ; |206|
;* || PACKH4 .L1 A26,A8,A5 ; |194|
;* || PACKH2 .S1X A7,B16,A17 ; |203| ^
;* 20 STNDW .D2T2 B19:B18,*B22--(16) ; |193| ^
;* || PACK2 .L2 B5,B16,B5 ; |182| ^
;* || PACKH2 .S1X A6,B5,A7 ; |198| ^
;* || PACKL4 .L1 A17,A16,A3 ; |205|
;* 21 PACKH4 .L1 A7,A9,A3 ; |201|
;* || STNW .D2T1 A3,*+B24(4) ; |205|
;* 22 STNDW .D1T2 B7:B6,*A25--(16) ; |194| ^
;* || PACKH4 .L1 A17,A16,A6 ; |206|
;* || [ B0] SUB .L2 B0,1,B0 ; |174|
;* 23 PACKL4 .L1 A7,A9,A6 ; |200|
;* || STNW .D1T1 A6,*+A23(4) ; |206|
;* || [ B0] B .S2 $C$C42 ; |174|
;* 24 PACKH4 .L1X A18,B5,A6 ; |184|
;* || STNW .D2T1 A6,*+B23(4) ; |200|
;* || ADD .L2X A22,B8,B8 ; |209|
;* 25 PACKL4 .L1X A18,B5,A7 ; |193|
;* || STNW .D1T1 A3,*+A24(4) ; |201|
;* || ADD .S1 A22,A4,A4 ; |214|
;* || ADD .L2X A22,B4,B5 ; |216|
;* 26 STNW .D1T1 A6,*+A25(12) ; |184| ^
;* || PACKL4 .L1 A26,A8,A6 ; |193|
;* || ADD .L2X A22,B25,B18 ; |210|
;* || ADD .S1 A22,A27,A3 ; |213|
;* 27 STNDW .D2T1 A7:A6,*+B22(8) ; |193| ^
;* || ADD .L1 A22,A19,A6 ; |211|
;* || MV .L2X A4,B4 ; |214| Define a twin register
;* 28 STNW .D1T1 A5,*+A25(8) ; |194| ^
;* || ADD .L1 A22,A20,A16 ; |212|
;* || MV .L2X A3,B6 ; |213| Define a twin register
;* || ADD .S1 A22,A21,A7 ; |215|
;* 29 ; BRANCHCC OCCURS {$C$C42} ; |174|
;*
;* RESTORE CODE
;*
;* ADD 4,B22,B22
;*----*
$C$L2: ; PIPED LOOP PROLOG
;** --*
$C$L3: ; PIPED LOOP KERNEL
$C$DW$L$_VixEye_format_yyuuyyvv2planaryuvRoate90$7$B:
.dwpsn file "YUVRotate.c",line 175,column 0,is_stmt
; EXCLUSIVE CPU CYCLES: 29
LDNW .D2T2 *B6,B6 ; |188| <0,0> ^
LDNW .D2T2 *B18,B17 ; |177| <0,1> ^
LDNW .D2T2 *B8,B7 ; |174| <0,2> ^
LDNW .D2T2 *B5,B9 ; |187| <0,3> ^
LDNW .D1T1 *A7,A5 ; |191| <0,4> ^
LDNW .D2T2 *B4,B16 ; |190| <0,5> ^
LDNW .D1T1 *A6,A8 ; |180| <0,6> ^
ADD .S2X A22,B8,B8 ; |209| <0,7>
|| PACK2 .L2 B7,B17,B19 ; |179| <0,7> ^
|| LDNW .D1T1 *A16,A9 ; |181| <0,7> ^
ADD .L2X A22,B18,B25 ; |210| <0,8>
|| LDNW .D2T1 *B8,A6 ; |174| <0,8> ^
|| ADD .L1 A22,A16,A20 ; |212| <0,8>
|| ADD .S1 A22,A6,A19 ; |211| <0,8>
ADD .L1 A22,A3,A27 ; |213| <0,9>
|| ADD .L2X A22,B5,B4 ; |216| <0,9>
|| ADD .S1 A22,A7,A21 ; |215| <0,9>
|| LDNW .D2T1 *B25,A7 ; |177| <0,9> ^
LDNW .D1T1 *A27,A16 ; |188| <0,10> ^
|| PACK2 .L2X B16,A5,B21 ; |192| <0,10> ^
ADD .L1 A22,A4,A4 ; |214| <0,11>
|| LDNW .D2T1 *B4,A17 ; |187| <0,11> ^
|| PACKH2 .L2 B9,B16,B16 ; |199| <0,11> ^
|| PACKH2 .S2X B7,A8,B7 ; |198| <0,11> ^
LDNW .D1T1 *A4,A8 ; |190| <0,12> ^
|| PACKH2 .S2X B17,A9,B5 ; |203| <0,12> ^
|| PACKH4 .L2 B7,B16,B18 ; |201| <0,12>
|| PACK2 .L1 A8,A9,A3 ; |182| <0,12> ^
LDNW .D1T1 *A21,A5 ; |191| <0,13> ^
|| PACKH2 .S2X B6,A5,B17 ; |204| <0,13> ^
|| PACKL4 .L2 B7,B16,B7 ; |200| <0,13>
LDNW .D1T2 *A20,B16 ; |181| <0,14> ^
|| PACK2 .L1 A6,A7,A18 ; |179| <0,14> ^
|| PACKH4 .L2 B5,B17,B20 ; |206| <0,14>
LDNW .D1T2 *A19,B5 ; |180| <0,15> ^
|| PACKL4 .L2 B5,B17,B17 ; |205| <0,15>
STNW .D2T2 B7,*B23--(8) ; |200| <0,16>
|| PACK2 .L1 A17,A16,A26 ; |189| <0,16> ^
|| PACKH4 .L2X B19,A3,B7 ; |194| <0,16>
STNW .D1T2 B18,*A24--(8) ; |201| <0,17>
|| PACKH2 .L1 A17,A8,A9 ; |199| <0,17> ^
|| PACK2 .S2 B9,B6,B6 ; |189| <0,17> ^
|| PACKL4 .L2X B19,A3,B19 ; |193| <0,17>
PACK2 .L1 A8,A5,A8 ; |192| <0,18> ^
|| PACKH2 .S1 A16,A5,A16 ; |204| <0,18> ^
|| PACKL4 .L2 B6,B21,B18 ; |193| <0,18>
|| STNW .D2T2 B17,*B24--(8) ; |205| <0,18>
PACKH4 .L1 A26,A8,A5 ; |194| <0,19>
|| PACKH2 .S1X A7,B16,A17 ; |203| <0,19> ^
|| PACKH4 .L2 B6,B21,B6 ; |194| <0,19>
|| STNW .D1T2 B20,*A23--(8) ; |206| <0,19>
STNDW .D2T2 B19:B18,*B22--(16) ; |193| <0,20> ^
|| PACK2 .L2 B5,B16,B5 ; |182| <0,20> ^
|| PACKH2 .S1X A6,B5,A7 ; |198| <0,20> ^
|| PACKL4 .L1 A17,A16,A3 ; |205| <0,20>
PACKH4 .L1 A7,A9,A3 ; |201| <0,21>
|| STNW .D2T1 A3,*+B24(4) ; |205| <0,21>
[ B0] SUB .L2 B0,1,B0 ; |174| <0,22>
|| STNDW .D1T2 B7:B6,*A25--(16) ; |194| <0,22> ^
|| PACKH4 .L1 A17,A16,A6 ; |206| <0,22>
STNW .D1T1 A6,*+A23(4) ; |206| <0,23>
|| PACKL4 .L1 A7,A9,A6 ; |200| <0,23>
|| [ B0] B .S2 $C$L3 ; |174| <0,23>
ADD .L2X A22,B8,B8 ; |209| <0,24>
|| STNW .D2T1 A6,*+B23(4) ; |200| <0,24>
|| PACKH4 .L1X A18,B5,A6 ; |184| <0,24>
ADD .L2X A22,B4,B5 ; |216| <0,25>
|| ADD .S1 A22,A4,A4 ; |214| <0,25>
|| STNW .D1T1 A3,*+A24(4) ; |201| <0,25>
|| PACKL4 .L1X A18,B5,A7 ; |193| <0,25>
ADD .L2X A22,B25,B18 ; |210| <0,26>
|| ADD .S1 A22,A27,A3 ; |213| <0,26>
|| PACKL4 .L1 A26,A8,A6 ; |193| <0,26>
|| STNW .D1T1 A6,*+A25(12) ; |184| <0,26> ^
STNDW .D2T1 A7:A6,*+B22(8) ; |193| <0,27> ^
|| MV .L2X A4,B4 ; |214| <0,27> Define a twin register
|| ADD .L1 A22,A19,A6 ; |211| <0,27>
.dwpsn file "YUVRotate.c",line 220,column 0,is_stmt
ADD .L1 A22,A20,A16 ; |212| <0,28>
|| ADD .S1 A22,A21,A7 ; |215| <0,28>
|| MV .L2X A3,B6 ; |213| <0,28> Define a twin register
|| STNW .D1T1 A5,*+A25(8) ; |194| <0,28> ^
$C$DW$L$_VixEye_format_yyuuyyvv2planaryuvRoate90$7$E:
;** --*
$C$L4: ; PIPED LOOP EPILOG
;** --*
$C$DW$L$_VixEye_format_yyuuyyvv2planaryuvRoate90$9$B:
; EXCLUSIVE CPU CYCLES: 4
MV .L1X B18,A9
MV .L1X B5,A5
MV .L1X B8,A8
ADD .L1X 4,B22,A21
$C$DW$L$_VixEye_format_yyuuyyvv2planaryuvRoate90$9$E:
;** --*
$C$L5:
$C$DW$L$_VixEye_format_yyuuyyvv2planaryuvRoate90$10$B:
; EXCLUSIVE CPU CYCLES: 6
;** -----------------------g7:
;** ----------------------- if ( !K$174 ) goto g9;
[!B2] B .S1 $C$L6
|| [ B2] LDNW .D1T1 *A6,A17 ; |180|
|| [!B2] SUB .L2 B1,1,B1 ; |149|
[ B2] LDNW .D1T1 *A16,A6 ; |181|
[ B2] LDNW .D1T1 *A5,A19 ; |187|
[ B2] LDNW .D1T1 *A4,A20 ; |190|
[ B2] LDNW .D1T1 *A9,A9 ; |177|
[ B2] LDNW .D1T1 *A8,A8 ; |174|
; BRANCHCC OCCURS {$C$L6}
$C$DW$L$_VixEye_format_yyuuyyvv2planaryuvRoate90$10$E:
;** --*
$C$DW$L$_VixEye_format_yyuuyyvv2planaryuvRoate90$11$B:
; EXCLUSIVE CPU CYCLES: 14
; Peeled loop iterations for unrolled loop:
;** 174 ----------------------- nPixel0 = _mem4((void *)pSrcRow0);
;** 177 ----------------------- nPixel1 = _mem4((void *)pSrcRow1);
;** 179 ----------------------- nTemp0 = _pack2(nPixel0, nPixel1);
;** 180 ----------------------- nPixel2 = _mem4((void *)pSrcRow2);
;** 181 ----------------------- nPixel3 = _mem4((void *)pSrcRow3);
;** 182 ----------------------- nTemp1 = _pack2(nPixel2, nPixel3);
;** 183 ----------------------- _mem4((void *)pY0) = _packl4(nTemp0, nTemp1);
;** 184 ----------------------- _mem4((void *)(nHeight+pY0)) = _packh4(nTemp0, nTemp1);
;** 187 ----------------------- nPixel4 = _mem4((void *)pSrcRow4);
;** 188 ----------------------- nPixel5 = _mem4((void *)pSrcRow5);
;** 189 ----------------------- nTemp0 = _pack2(nPixel4, nPixel5);
;** 190 ----------------------- nPixel6 = _mem4((void *)pSrcRow6);
;** 191 ----------------------- nPixel7 = _mem4((void *)pSrcRow7);
;** 192 ----------------------- nTemp1 = _pack2(nPixel6, nPixel7);
;** 193 ----------------------- _mem4((void *)(pY0-4)) = _packl4(nTemp0, nTemp1);
;** 194 ----------------------- _mem4((void *)(nHeight+pY0-4)) = _packh4(nTemp0, nTemp1);
;** 198 ----------------------- nTemp0 = _packh2(nPixel0, nPixel2);
;** 199 ----------------------- nTemp1 = _packh2(nPixel4, nPixel6);
;** 200 ----------------------- _mem4((void *)pU0) = _packl4(nTemp0, nTemp1);
;** 201 ----------------------- C$2 = (unsigned)nHeight>>1;
;** 201 ----------------------- _mem4((void *)(C$2+pU0)) = _packh4(nTemp0, nTemp1);
;** 203 ----------------------- nTemp0 = _packh2(nPixel1, nPixel3);
;** 204 ----------------------- nTemp1 = _packh2(nPixel5, nPixel7);
;** 205 ----------------------- _mem4((void *)pV0) = _packl4(nTemp0, nTemp1);
;** 206 ----------------------- _mem4((void *)(C$2+pV0)) = _packh4(nTemp0, nTemp1);
;** 217 ----------------------- pY0 -= 8;
;** 218 ----------------------- pU0 -= 4;
;** 219 ----------------------- pV0 -= 4;
LDNW .D1T1 *A3,A16 ; |188|
|| SHRU .S2 B26,1,B4 ; |201|
|| ADD .L2X A21,B26,B5 ; |184|
|| ADD .L1X A21,B26,A23 ; |194|
|| SUB .D2 B1,1,B1 ; |149|
LDNW .D1T1 *A7,A18 ; |191|
|| PACK2 .L1 A17,A6,A7 ; |182|
|| ADD .L2 B23,B4,B30 ; |201|
|| ADD .S2 B24,B4,B31 ; |206|
PACKH2 .L1 A19,A20,A5 ; |199|
PACKH2 .L1 A9,A6,A4 ; |203|
PACKH2 .L1 A8,A17,A31 ; |198|
|| PACK2 .S1 A8,A9,A22 ; |179|
PACKL4 .L1 A31,A5,A8 ; |200|
|| PACK2 .S1 A19,A16,A24 ; |189|
PACKL4 .L1 A22,A7,A6 ; |183|
|| STNW .D2T1 A8,*B23 ; |200|
|| PACK2 .S1 A20,A18,A25 ; |192|
|| SUB .L2 B23,4,B23 ; |218|
STNW .D1T1 A6,*A21 ; |183|
|| PACKH4 .L1 A31,A5,A5 ; |201|
|| PACKH2 .S1 A16,A18,A3 ; |204|
PACKH4 .L1 A22,A7,A7 ; |184|
|| STNW .D2T1 A5,*B30 ; |201|
STNW .D2T1 A7,*B5 ; |184|
|| PACKL4 .L1 A24,A25,A5 ; |193|
PACKL4 .L1 A4,A3,A27 ; |205|
|| STNW .D1T1 A5,*-A21(4) ; |193|
|| SUB .S1 A21,8,A21 ; |217|
STNW .D2T1 A27,*B24 ; |205|
|| PACKH4 .L1 A4,A3,A26 ; |206|
|| SUB .L2 B24,4,B24 ; |219|
PACKH4 .L1 A24,A25,A4 ; |194|
|| STNW .D2T1 A26,*B31 ; |206|
STNW .D1T1 A4,*-A23(4) ; |194|
$C$DW$L$_VixEye_format_yyuuyyvv2planaryuvRoate90$11$E:
;** --*
$C$L6:
$C$DW$L$_VixEye_format_yyuuyyvv2planaryuvRoate90$12$B:
; EXCLUSIVE CPU CYCLES: 6
;** -----------------------g9:
;** 149 ----------------------- pY0 += K$185;
;** 224 ----------------------- pU0 += K$187;
;** 225 ----------------------- pV0 += K$187;
;** 149 ----------------------- U$34 += 4;
;** 149 ----------------------- if ( L$1 = L$1-1 ) goto g3;
ADD .L1 4,A28,A28 ; |149|
|| [ B1] B .S1 $C$L1 ; |149|
|| [ B1] ADD .S2 7,B26,B4
|| [ B1] CMPGT .L2 B26,0,B0 ; |174|
|| ADD .D1 A29,A21,A21 ; |149|
|| ADD .D2 B27,B24,B24 ; |225|
[ B1] ADD .L1 A30,A28,A9 ; |153|
|| [ B1] SHR .S2 B4,2,B4
|| [ B1] MV .S1 A28,A8 ; |152|
|| ADD .L2 B27,B23,B23 ; |224|
[ B1] ADD .L1 A30,A9,A6 ; |154|
|| [ B1] SHRU .S2 B4,29,B4
|| [ B1] MV .S1X B0,A0 ; guard predicate rewrite
[ B1] ADD .L1 A30,A6,A16 ; |155|
|| [ B1] ADD .L2 B26,B4,B4
|| [ B1] MV .S1X B0,A1 ; |174| branch predicate copy
[ B1] ADD .L1 A30,A16,A5 ; |156|
|| [ B1] ADD .L2 7,B4,B4
.dwpsn file "YUVRotate.c",line 226,column 0,is_stmt

2010-12-03

shengkai.sun
Sun,

I think the use of the _nassert() needs a re-read.

And the _nassert() calls only needs to be executed once,
at the begining of the function to assure the alignment of the passed in
pointer parameters and the sizing of the width and height parameters.

Any and all other _nassert() calls are unneeded and just waste CPU cycles.

More than likely, the compiler will not have enough registers available for
the number of 'register' modifiers listed in this function.
Therefore, *I* would only place the 'register' modifier on those variables
which are most heavily used.

R. Williams

---------- Original Message -----------
From: "shengkai.sun"
To: "c6x"
Sent: Fri, 3 Dec 2010 16:58:40 +0800
Subject: [c6x] Re: Image Rotate Optimize

> Thansk for your reply, Williams!
>
> I have modified the code according to what you said, but the
> performance is also the same. I checked the asm code and maybe the
> bottleneck lies in data fetching. Here is the modified code:
>
> #define ALIGNED_ARRAY8(ptr) _nassert((int)ptr % 8 == 0)
> #define ALIGNED_ARRAY4(ptr) _nassert((int)ptr % 4 == 0)
> int VixEye_format_yyuuyyvv2planaryuvRoate90(char *restrict pInBuf,
> char *restrict pOutBuf, const int nWidth, const int nHeight, int
> nRightLeft) { int i, j;
>
> //src pionters
> register char *restrict pSrcRow0 = pInBuf;
> register char *restrict pSrcRow1 = pInBuf;
> register char *restrict pSrcRow2 = pInBuf;
> register char *restrict pSrcRow3 = pInBuf;
> register char *restrict pSrcRow4 = pInBuf;
> register char *restrict pSrcRow5 = pInBuf;
> register char *restrict pSrcRow6 = pInBuf;
> register char *restrict pSrcRow7 = pInBuf;
>
> //dst pionters
> register char *restrict pY0;
> register char *restrict pU0;
> register char *restrict pV0;
> register unsigned int nOffset = nHeight;
> register unsigned int nPixel0, nPixel1, nPixel2, nPixel3, nPixel4,
> nPixel5, nPixel6, nPixel7; register unsigned int nTemp0, nTemp1;
> register int nYuvWidth = (nWidth << 1);
>
> pY0 = pOutBuf + nHeight - 4;
> pU0 = pOutBuf + nWidth * nHeight + (nHeight >> 1) - 4;
> pV0 = pOutBuf + nWidth * nHeight + (nWidth * nHeight >> 1) + (nHeight
> >> 1) - 4;
>
> ALIGNED_ARRAY8(pInBuf);
> ALIGNED_ARRAY8(pOutBuf);
> ALIGNED_ARRAY4(pSrcRow0);
> ALIGNED_ARRAY4(pSrcRow1);
> ALIGNED_ARRAY4(pSrcRow2);
> ALIGNED_ARRAY4(pSrcRow3);
> ALIGNED_ARRAY4(pY0);
> ALIGNED_ARRAY4(pU0);
> ALIGNED_ARRAY4(pV0);
> //#pragma MUST_ITERATE(AOI_WIDTH/4,FF_WIDTH/4,16); //AOI_WIDTH = 752,
> FF_WIDTH = 1600
> #pragma UNROLL(8);
> for(i = 0; i < nYuvWidth; i += 4)
> {
> //src points to the next 4 columns
> pSrcRow0 = pInBuf + i; //first row
> pSrcRow1 = pSrcRow0 + nYuvWidth; //second row;
> pSrcRow2 = pSrcRow1 + nYuvWidth; //3rd row;
> pSrcRow3 = pSrcRow2 + nYuvWidth; //4th row
> pSrcRow4 = pSrcRow3 + nYuvWidth; //5th row;
> pSrcRow5 = pSrcRow4 + nYuvWidth; //6th row;
> pSrcRow6 = pSrcRow5 + nYuvWidth; //7th row
> pSrcRow7 = pSrcRow6 + nYuvWidth; //8th row
> ALIGNED_ARRAY4(pSrcRow0);
> ALIGNED_ARRAY4(pSrcRow1);
> ALIGNED_ARRAY4(pSrcRow2);
> ALIGNED_ARRAY4(pSrcRow3);
> ALIGNED_ARRAY4(pSrcRow4);
> ALIGNED_ARRAY4(pSrcRow5);
> ALIGNED_ARRAY4(pSrcRow6);
> ALIGNED_ARRAY4(pSrcRow7);
> ALIGNED_ARRAY4(pY0);
> ALIGNED_ARRAY4(pU0);
> ALIGNED_ARRAY4(pV0);
> //#pragma MUST_ITERATE(AOI_HEIGHT/8,FF_HEIGHT/8,2); //AOI_HEIGHT = 480,
> FF_HEIGHT = 1200
> #pragma UNROLL(2);
> for(j = 0; j < nHeight; j += 8)
> {
> nPixel0 = _mem4(pSrcRow0); //u1u0y1y0
> nPixel1 = _mem4(pSrcRow1); //v1v0y3y2
> nOffset = nHeight; //length of Y after rotation
> nTemp0 = _pack2(nPixel0, nPixel1); //y1y0y3y2
>
> nPixel2 = _mem4(pSrcRow2); //u3u2y5y4
> nPixel3 = _mem4(pSrcRow3); //v3v2y7y6
> nTemp1 = _pack2(nPixel2, nPixel3); //y5y4y7y6
> _mem4(pY0) = _packl4(nTemp0, nTemp1); //y0y2y4y6, row1 after rotate,
> little endian _mem4(pY0 + nOffset) = _packh4(nTemp0, nTemp1);
> //y1y3y5y7, row2 after rotate, little endian
>
> pY0 -= 4;
> nPixel4 = _mem4(pSrcRow4); //u5u4y9y8
> nPixel5 = _mem4(pSrcRow5); //v5v4y11y10
> nTemp0 = _pack2(nPixel4, nPixel5); //y9y8y11y10
> nPixel6 = _mem4(pSrcRow6); //u7u6y13y12
> nPixel7 = _mem4(pSrcRow7); //v7v6y15y14
> nTemp1 = _pack2(nPixel6, nPixel7); //y13y12y15y14
> _mem4(pY0) = _packl4(nTemp0, nTemp1); //y8y10y12y14, row1 after rotate,
> little endian _mem4(pY0 + nOffset) = _packh4(nTemp0, nTemp1);
> //y9y11y13y15, row2 after rotate, little endian
>
> nOffset >>= 1; //divided by 2, because width of U, V is 1/2 of Y
> nTemp0 = _packh2(nPixel0, nPixel2); //u1u0u3u2
> nTemp1 = _packh2(nPixel4, nPixel6); //u5u4u7u6
> _mem4(pU0) = _packl4(nTemp0, nTemp1); //u0u2u4u6, row1 after rotate,
> little endian _mem4(pU0 + nOffset) = _packh4(nTemp0, nTemp1);
> //u1u3u5u7, row2 after rotate, little endian
>
> nTemp0 = _packh2(nPixel1, nPixel3); //v1v0v3v2
> nTemp1 = _packh2(nPixel5, nPixel7); //v5v4v7v6
> _mem4(pV0) = _packl4(nTemp0, nTemp1); //v0v2v4v6
> _mem4(pV0 + nOffset) = _packh4(nTemp0, nTemp1); //v1v3vv5v7
>
> nOffset = nYuvWidth << 3;
> pSrcRow0 += nOffset; //jump 8 rows
> pSrcRow1 += nOffset;
> pSrcRow2 += nOffset;
> pSrcRow3 += nOffset;
> pSrcRow5 += nOffset;
> pSrcRow6 += nOffset;
> pSrcRow7 += nOffset;
> pSrcRow4 += nOffset;
> pY0 -= 4; //go to previous column
> pU0 -= 4;
> pV0 -= 4;
> }
>
> //Dst points to next 2 rows, and the last column
> pY0 += 3 * nHeight;
> pU0 += (3 * nHeight >> 1);
> pV0 += (3 * nHeight >> 1);
> }
> }
>
> The followings are the generated asm code,
> Is there anything I can do to improve the effciency?
> Thanks!!
>
> ;****************************************************************************
**
> ;* FUNCTION NAME: yyuuyyvv2planaryuvRoate90 *
> ;*
> * ;* Regs Modified : A0,A1,A3,A4,A5,A6,A7,A8,A9,B0,B1,B2,
> B4,B5,B6,B7,B8, * ;* B9,A16,A17,A18,A19,A20,
> A21,A22,A23,A24,A25,A26, * ;* A27,A28,A29,
> A30,A31,B16,B17,B18,B19,B20,B21,B22, * ;*
> B23,B24,B25,B26,B27,B30,B31 * ;* Regs Used
> : A0,A1,A3,A4,A5,A6,A7,A8,A9,B0,B1,B2,B3,B4,B5,B6,B7, * ;*
> B8,B9,A16,A17,A18,A19,A20,A21,A22,A23,A24,A25, *
> ;* A26,A27,A28,A29,A30,A31,B16,B17,B18,B19,
> B20,B21, * ;* B22,B23,B24,B25,B26,B27,B30,
> B31 * ;* Local Frame Size : 0 Args + 0 Auto + 0
> Save = 0 byte
* ;****************************************************************************
**


> $C$DW$L$_VixEye_format_yyuuyyvv2planaryuvRoate90$5$E:
> ;*--------------------------------*
> ;* SOFTWARE PIPELINE INFORMATION
> ;*
> ;* Loop source line : 174
> ;* Loop opening brace source line : 175
> ;* Loop closing brace source line : 220
> ;* Loop Unroll Multiple : 2x
> ;* Known Minimum Trip Count : 1
> ;* Known Max Trip Count Factor : 1
> ;* Loop Carried Dependency Bound(^) : 5
> ;* Unpartitioned Resource Bound : 15
> ;* Partitioned Resource Bound(*) : 29
> ;* Resource Partition:
> ;* A-side B-side
> ;* .L units 8 8
> ;* .S units 0 1
> ;* .D units 15 14
> ;* .M units 0 0
> ;* .X cross paths 4 14
> ;* .T address paths 29* 29*
> ;* Long read paths 0 0
> ;* Long write paths 0 0
> ;* Logical ops (.LS) 8 8 (.L or .S unit)
> ;* Addition ops (.LSD) 10 9 (.L or .S or .D
> unit)
> ;* Bound(.L .S .LS) 8 9
> ;* Bound(.L .S .D .LS .LSD) 14 14
> ;*
> ;* Searching for software pipeline schedule at ...
> ;* ii = 29 Schedule found with 1 iterations in parallel
> ;*
> ;* Register Usage Table:
> ;*
> +-----------------------------+
> ;* |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
> ;* |00000000001111111111222222222233|00000000001111111111222222222233|
> ;* |01234567890123456789012345678901|01234567890123456789012345678901|
> ;* |--------------------------------+--------------------------------|

> .dwpsn file "YUVRotate.c",line 226,column 0,is_stmt
>
> 2010-12-03
>
> shengkai.sun
------- End of Original Message -------

_____________________________________