Sign in

Not a member? | Forgot your Password?

Search code

Search tips


Free PDF Downloads

A Quadrature Signals Tutorial: Complex, But Not Complicated

Understanding the 'Phasing Method' of Single Sideband Demodulation

Complex Digital Signal Processing in Telecommunications

Introduction to Sound Processing

C++ Tutorial

IIR Filter Design Software

See Also

Embedded SystemsFPGA

DSP Code Sharing > Computing Reciprocals of Fixed-Point Numbers by the Newton-Raphson Method — Division by Multiplication

Computing Reciprocals of Fixed-Point Numbers by the Newton-Raphson Method — Division by Multiplication

Language: ASM

Processor: TI C64x

Submitted by Shuhua Zhang on Aug 11 2011

Licensed under a Creative Commons Attribution 3.0 Unported License

Computing Reciprocals of Fixed-Point Numbers by the Newton-Raphson Method — Division by Multiplication


 

In some adaptive filters, say, normalized least mean squares adaptive filter  (NLMS) and multidelay block frequency domain adaptive filter (MDF), update to the filter coefficients is normalized for stability and faster convergence. This kind of normalization needs a computation that all fixed-point DSPs hate very much — division.    

So there is a routine called 'DSP_recip16' to find reciprocals in TI's C64x and C64x+ DSP libraries ('dsp64x.lib' and 'dsp64x+.lib', respectively). Unfortunately, 'DSP_recip16' uses an inefficient algorithm: dividing by a loop of conditional subtracts, costing 8 cycles per input number (in the format of Q.15).

We have a better way — the time-honored Newton-Raphson method. Specificialy, given the fuction f(U) = 1/U - V, then
    Un+1 = Un - f(Un)/f'(Un) = Un(2-VUn) → 1/V
quadratically, i.e., error shrinking squarely, or effective digits doubling each step. 

The following is a hand tuned linear assembly code of finding reciprocals of a vector of Q.15 numbers based on the above method. 

To compute
    y[i] = ym[i] * 2^ye[i] = 1 / x[i], for i = 0,...,N-1, and -1 ≤ x[i] < 1,
where ym is the mantissa (significand) vector and ye is the exponent vector, call
    DSP_vrecip16n(x, ym, ye, N).
This interface is identical to that of TI's 'DSP_recip16'.

Under CCS5.1 (Linux) and with compiler option -O2 (software pipelining turned on), 'DSP_vrecip16n' takes 55+2.5*N to compute N reciprocals. It must be noted that N here is supposed to be a multiple of 4 (loading 4 short numbers a time) and greater than or equal to 24 (software pipelining).  If N is not a multiple of 4, 'DSP_vrecip16n' effectively operate on 4*floor(N/4) inputs. If N is less than 24, the results might be wrong since software pipelining is broken in this case.

The maximum relative error is 2-16. The storage type of the input x[i] and the mantissa part of the output ym[i] is 16-bit short, but arithmetically, they are interpreted as Q.15 fractional numbers, equal to x[i]/215 and ym[i]/215, respectively. However, the exponent part of the output ye[i] is an integer. All these conventions are the same as TI's 'DSP_recip16'.

The specific algorithm used here is 

  •     normalize abs(x[i]) to [.5, 1) and denote it by V;
  •     initialization: U0 = (V - .75)2 + 1.4256 - V, quadratic fitting of 1/(2V) with 5.9233 effective bits;
  •     iteration 1: U1 = U0(1 - VU0), ~ 1/(4V) with 11.481 effective bits;
  •     iteration 2: U2 = 8U1(.5 - VU1), ~ 1/(2V) with 22.585 effective bits;
  •     output 1/x[i] in the format Fl16, or Q.15 mantissa | Int16 exponent.

With the quadratic convergence of the Newton-Raphson method and a fine-tuned initial value, in just 2 iterations, we have 22.585 effective bits — more than could be stored in Fl16. 

Now, C64x and C64x+ could hate division less:) 

 
* =========================================================================== *
*                                                                             *
*  Compute reciprocals of a Q.15 vector by Newton-Raphson method              *
*      y[i] = ym[i] * 2^ye[i] = 1 / x[i], -1 <= x[i] < 1 and x[i] != 0        *
*      where ym is the mantissa vector and ye is the exponent vector          *
*                                                                             *
*  C prototype:                                                               *
*       void DSP_vrecip16n(short* x,    // Input, Q.15 vector                 *
*                          short* ym,   // Output, Q.15 vector                *
*                          short* ye,   // Output, int16 vector               *
*                          int    N);   // Input, vector length               *
*                                                                             *
*  Restriction:                                                               *
*       (N % 4) == 0 and N >= 24                                              *
*                                                                             *
*  Performance:                                                               *
*       55+2.5*N (software pipelining enabled by -O2, CCS5.1)                 *
*                                                                             *
*  Relative error:                                                            *
*       (-2^-16, 2^-16)                                                       *
*                                                                             *
*  Algorithm:                                                                 *
*       Input: V, abs(x[i]) normalized to [.5, 1)                             *
*       Initialization: U0 = (V-.75)^2 + 1.4256-V, ~1/(2*V), 5.9233 bits      *
*       Iteration 1:    U1 = U0*(1-V*U0),          ~1/(4*V), 11.481 bits      *
*       Iteration 2:    U2 = 8*U1*(.5-V*U1),       ~1/(2*V), 22.585 bits      *
*       Output in the format of Fl16 = Q.15(mant)|Int16(expt)                 *
*                                                                             *
* =========================================================================== *
   
        .sect ".text: _DSP_vrecip16n"
        .global _DSP_vrecip16n

_DSP_vrecip16n: .cproc  A_X, B_YM, A_YE, B_n  
   
        .no_mdep
        .rega   A_x0, A_x1, A_rr, A_nx0, A_nx1, A_nx10, A_ny10, A_v10, A_vc10  
        .rega   A_vs1:A_vs0, A_vs10, A_u10, A_vu0, A_vu1, A_vu10, A_u0, A_u1  
        .rega   A_y0, A_y1, A_y32:A_y10, A_vp10, A_x32:A_x10
        .rega   A_ss, A_cc, A_mm, A_ww, A_w, A_rnd
        .regb   B_x2, B_x3, B_rr, B_nx2, B_nx3, B_nx32, B_y32, B_v32, B_vc32
        .regb   B_vs3:B_vs2, B_vs32, B_u32, B_vu2, B_vu3, B_vu32, B_u2, B_u3    
        .regb   B_y2, B_y3, B_ny32:B_ny10, B_vp32
        .regb   B_ss, B_cc, B_mm, B_ww, B_w, B_rnd, B_nn, B_X
        .reg    B_i, C10, C32, Cl10, Cl32

            ADD         4,              A_X,            B_X
           
            MVKL        0xB67BB67B,     B_mm                    ; Q15(1.4256)                  
            MVKH        0xB67BB67B,     B_mm                    ; Q15(1.4256)
            MV          B_mm,           A_mm
            MVKL        0x60006000,     A_cc
            MVKH        0x60006000,     A_cc
            MV          A_cc,           B_cc      
            MVKL        0xFFF1FFF1,     B_rr
            MVKH        0xFFF1FFF1,     B_rr
            MV          B_rr,           A_rr
            MVKL        0x80008000,     A_ww
            MVKH        0x80008000,     A_ww
            MV          A_ww,           B_ww
            SHL         A_ww,           15,             A_w     ; 0x40000000  
            SHL         B_ww,           15,             B_w     ; 0x40000000          
            ROTL        A_w,            17,             A_rnd   ; 0x00008000
            ROTL        B_w,            17,             B_rnd   ; 0x00008000
            MVK         15,             A_ss
            MVK         15,             B_ss

            SHR         B_n,            2,              B_i
            SUB         B_i,            2,              B_i
   
LOOP_vrecip: .trip 16
            LDH         *A_X++,         A_x0
            LDH         *A_X++[3],      A_x1  
            LDH         *B_X++,         B_x2
            LDH         *B_X++[3],      B_x3
 
            NORM        A_x0,           A_nx0
            NORM        A_x1,           A_nx1
            NORM        B_x2,           B_nx2
            NORM        B_x3,           B_nx3
            PACK2       A_nx1,          A_nx0,          A_nx10
            PACK2       B_nx3,          B_nx2,          B_nx32
            ADD2        A_rr,           A_nx10,         A_ny10
            ADD2        B_rr,           B_nx32,         B_ny32

            SSHVL       A_x0,           A_nx0,          A_x0
            SSHVL       A_x1,           A_nx1,          A_x1
            SSHVL       B_x2,           B_nx2,          B_x2
            SSHVL       B_x3,           B_nx3,          B_x3
            PACKH2      A_x1,           A_x0,           A_v10
            PACKH2      B_x3,           B_x2,           B_v32
           
            ; Initial value, U0=(V-.75)^2 + 1.4256-V, ~1/(2*V), 5.9233 bits
            ABS2        A_v10,          A_vp10
            ABS2        B_v32,          B_vp32
            SUB2        A_vp10,         A_cc,           A_vc10
            SUB2        B_vp32,         B_cc,           B_vc32
            SUB2        A_mm,           A_vp10,         A_u10
            SUB2        B_mm,           B_vp32,         B_u32
            SMPY2       A_vc10,         A_vc10,         A_vs1:A_vs0
            SMPY2       B_vc32,         B_vc32,         B_vs3:B_vs2
            PACKH2      A_vs1,          A_vs0,          A_vs10
            PACKH2      B_vs3,          B_vs2,          B_vs32
            ADD2        A_vs10,         A_u10,          A_u10
            ADD2        B_vs32,         B_u32,          B_u32
            SHR2        A_v10,          A_ss,           C10
            SHR2        B_v32,          B_ss,           C32
            XOR         C10,            A_u10,          A_u10
            XOR         C32,            B_u32,          B_u32
                       
            ; Iteration 1, U1=U0*(1-V*U0), ~1/(4*V), 11.481 bits
            SMPY2       A_v10,          A_u10,          A_vu1:A_vu0
            SMPY2       B_v32,          B_u32,          B_vu3:B_vu2
            PACKH2      A_vu1,          A_vu0,          A_vu10
            PACKH2      B_vu3,          B_vu2,          B_vu32
            SUB2        A_ww,           A_vu10,         A_vu10
            SUB2        B_ww,           B_vu32,         B_vu32
            SMPY2       A_u10,          A_vu10,         A_u1:A_u0
            SMPY2       B_u32,          B_vu32,         B_u3:B_u2
            PACKH2      A_u1,           A_u0,           A_u10
            PACKH2      B_u3,           B_u2,           B_u32

            ; Iteration 2, U2=8*U1*(1/2-V*U1), ~1/(2*V), 22.585 bits
            SMPY2       A_v10,          A_u10,          A_vu1:A_vu0
            SMPY2       B_v32,          B_u32,          B_vu3:B_vu2
            SUB         A_w,            A_vu0,          A_vu0
            SUB         A_w,            A_vu1,          A_vu1
            SUB         B_w,            B_vu2,          B_vu2
            SUB         B_w,            B_vu3,          B_vu3
            MPYHIR      A_u0,           A_vu0,          A_u0
            MPYHIR      A_u1,           A_vu1,          A_u1
            MPYHIR      B_u2,           B_vu2,          B_u2
            MPYHIR      B_u3,           B_vu3,          B_u3
            SSHL        A_u0,           3,              A_u0
            SSHL        A_u1,           3,              A_u1
            SSHL        B_u2,           3,              B_u2
            SSHL        B_u3,           3,              B_u3          

            ; Fl16 = Q15(mant)|int16(expt)
            SADD        A_u0,           A_rnd,          A_y0
            SADD        A_u1,           A_rnd,          A_y1
            SADD        B_u2,           B_rnd,          B_y2
            SADD        B_u3,           B_rnd,          B_y3
            PACKH2      A_y1,           A_y0,           A_y10
            PACKH2      B_y3,           B_y2,           B_y32
 
            MV          B_y32,          A_y32
            MV          A_ny10,         B_ny10

            STDW        A_y32:A_y10,    *B_YM++
            STDW        B_ny32:B_ny10,  *A_YE++

            BDEC        LOOP_vrecip,    B_i
           
            .endproc
 
 
Rate this code snippet:
0
Rating: 0 | Votes: 0
 
   
 
posted by Shuhua Zhang
Awarded a doctor's degree in electronic engineering from Tsinghua University January 2011, now I work in the Gipsa-lab in France as a post-doctoral researcher doing informed source separation.


Comments


 

rrlagic wrote:

2/6/2012
 
Hi!
This contribution might be a very useful. However, only for people with good knowledge of C64x architecture and assembly. I have compared performance of this contribution to DSP lib's function and found some discrepancies on short vectors. If we look inside assembly, ".trip 16" suggests that loop minimum trip count is 16. Then, right shift of element count indicates that we process 4 elements at a time. So what is minimum input vector length? Specifying restrictions would be helpful.
Anyway, very nice idea and good hint.
 

shuhua.zhang wrote:

2/16/2012
 
Hi,
Many thanks for your comment and pointing out the problem. I've updated this post to emphasize the restriction on N (>=24 and =0(mod 4)) and give an exact cpu cycle count (102+2.5*N) obtained from CCS4.  

Results displayed in CCS4 are all int16. However, x[i] and ym[i] should interpreted as Q.15, while ye[i] is still int16. Therefore, to verify the computation results, a simple way is to compare  (1/ym[i])*2^(30-ye[i]) and x[i], this time, all viewed as int16.

BTW, I experimented other trip counts in CCS4. (This piece of code was written about 4 years ago, when CCS3 was the latest.) For trip count below 8, CCS4 can't optimize the linear assembly code efficiently, For trip count no less than 8, CCS4 effectively turns on software pipeline. But even with trip count 8, the same restriction on N applies.
 

shuhua.zhang wrote:

2/16/2012
 
Hi,
Many thanks for your comment and pointing out the problem. I've updated this post to emphasize the restriction on N (>=24 and =0(mod 4)) and give an exact cpu cycle count (102+2.5*N) obtained from CCS4.

Results displayed in CCS4 are all int16. However, x[i] and ym[i] should interpreted as Q.15, while ye[i] is still int16. Therefore, to verify the computation results, a simple way is to compare (1/ym[i])*2^(30-ye[i]) and x[i], this time, all viewed as int16.

BTW, I experimented other trip counts in CCS4. (This piece of code was written about 4 years ago for an acoustic echo cancellation application with hundreds to thousands of filter taps. At that time, CCS3 was the latest.) For trip count below 8, CCS4 can't optimize the linear assembly code efficiently. For trip count no less than 8, CCS4 effectively turns on software pipelining. But even with trip count 8, the same restriction on N applies.
 

shuhua.zhang wrote:

2/16/2012
 
Hi,
Many thanks for your comment and pointing out the problem. I've updated this post to emphasize the restriction on N (>=24 and =0(mod 4)) and give an exact cpu cycle count (102+2.5*N) obtained from CCS4.

Results displayed in CCS4 are all int16. However, x[i] and ym[i] should be  interpreted as Q.15, while ye[i] is still int16. Therefore, to verify the computation results, a simple way is to compare (1/ym[i])*2^(30-ye[i]) and x[i], this time, all viewed as int16.

BTW, I experimented other trip counts in CCS4. (This piece of code was written about 4 years ago for an acoustic echo cancellation application with hundreds to thousands of filter taps. At that time, CCS3 was the latest.) For trip count below 8, CCS4 can't optimize the linear assembly code efficiently. For trip count no less than 8, CCS4 effectively turns on software pipelining. But even with trip count 8, the same restriction on N applies.
 

nki wrote:

7/8/2013
 
hi Shuhua Zhang
thanks for your posting,can explain your solution with bit number in Q15 i need it for FPGA

Add a Comment
You need to login before you can post a comment (best way to prevent spam). ( Not a member? )