Computing Reciprocals of Fixed-Point Numbers by the Newton-Raphson Method — Division by Multiplication

August 11, 20115 comments Coded in ASM for the TI C64x

In some adaptive filters, say, normalized least mean squares adaptive filter  (NLMS) and multidelay block frequency domain adaptive filter (MDF), update to the filter coefficients is normalized for stability and faster convergence. This kind of normalization needs a computation that all fixed-point DSPs hate very much — division.    

So there is a routine called 'DSP_recip16' to find reciprocals in TI's C64x and C64x+ DSP libraries ('dsp64x.lib' and 'dsp64x+.lib', respectively). Unfortunately, 'DSP_recip16' uses an inefficient algorithm: dividing by a loop of conditional subtracts, costing 8 cycles per input number (in the format of Q.15).

We have a better way — the time-honored Newton-Raphson method. Specificialy, given the fuction f(U) = 1/U - V, then
    Un+1 = Un - f(Un)/f'(Un) = Un(2-VUn) → 1/V
quadratically, i.e., error shrinking squarely, or effective digits doubling each step. 

The following is a hand tuned linear assembly code of finding reciprocals of a vector of Q.15 numbers based on the above method. 

To compute
    y[i] = ym[i] * 2^ye[i] = 1 / x[i], for i = 0,...,N-1, and -1 ≤ x[i] < 1,
where ym is the mantissa (significand) vector and ye is the exponent vector, call
    DSP_vrecip16n(x, ym, ye, N).
This interface is identical to that of TI's 'DSP_recip16'.

Under CCS5.1 (Linux) and with compiler option -O2 (software pipelining turned on), 'DSP_vrecip16n' takes 55+2.5*N to compute N reciprocals. It must be noted that N here is supposed to be a multiple of 4 (loading 4 short numbers a time) and greater than or equal to 24 (software pipelining).  If N is not a multiple of 4, 'DSP_vrecip16n' effectively operate on 4*floor(N/4) inputs. If N is less than 24, the results might be wrong since software pipelining is broken in this case.

The maximum relative error is 2-16. The storage type of the input x[i] and the mantissa part of the output ym[i] is 16-bit short, but arithmetically, they are interpreted as Q.15 fractional numbers, equal to x[i]/215 and ym[i]/215, respectively. However, the exponent part of the output ye[i] is an integer. All these conventions are the same as TI's 'DSP_recip16'.

The specific algorithm used here is 

  •     normalize abs(x[i]) to [.5, 1) and denote it by V;
  •     initialization: U0 = (V - .75)2 + 1.4256 - V, quadratic fitting of 1/(2V) with 5.9233 effective bits;
  •     iteration 1: U1 = U0(1 - VU0), ~ 1/(4V) with 11.481 effective bits;
  •     iteration 2: U2 = 8U1(.5 - VU1), ~ 1/(2V) with 22.585 effective bits;
  •     output 1/x[i] in the format Fl16, or Q.15 mantissa | Int16 exponent.

With the quadratic convergence of the Newton-Raphson method and a fine-tuned initial value, in just 2 iterations, we have 22.585 effective bits — more than could be stored in Fl16. 

Now, C64x and C64x+ could hate division less:) 

* =========================================================================== *
*                                                                             *
*  Compute reciprocals of a Q.15 vector by Newton-Raphson method              *
*      y[i] = ym[i] * 2^ye[i] = 1 / x[i], -1 <= x[i] < 1 and x[i] != 0        *
*      where ym is the mantissa vector and ye is the exponent vector          *
*                                                                             *
*  C prototype:                                                               *
*       void DSP_vrecip16n(short* x,    // Input, Q.15 vector                 *
*                          short* ym,   // Output, Q.15 vector                *
*                          short* ye,   // Output, int16 vector               *
*                          int    N);   // Input, vector length               *
*                                                                             *
*  Restriction:                                                               *
*       (N % 4) == 0 and N >= 24                                              *
*                                                                             *
*  Performance:                                                               *
*       55+2.5*N (software pipelining enabled by -O2, CCS5.1)                 *
*                                                                             *
*  Relative error:                                                            *
*       (-2^-16, 2^-16)                                                       *
*                                                                             *
*  Algorithm:                                                                 *
*       Input: V, abs(x[i]) normalized to [.5, 1)                             *
*       Initialization: U0 = (V-.75)^2 + 1.4256-V, ~1/(2*V), 5.9233 bits      *
*       Iteration 1:    U1 = U0*(1-V*U0),          ~1/(4*V), 11.481 bits      *
*       Iteration 2:    U2 = 8*U1*(.5-V*U1),       ~1/(2*V), 22.585 bits      *
*       Output in the format of Fl16 = Q.15(mant)|Int16(expt)                 *
*                                                                             *
* =========================================================================== *
        .sect ".text: _DSP_vrecip16n"
        .global _DSP_vrecip16n

_DSP_vrecip16n: .cproc  A_X, B_YM, A_YE, B_n   
        .rega   A_x0, A_x1, A_rr, A_nx0, A_nx1, A_nx10, A_ny10, A_v10, A_vc10  
        .rega   A_vs1:A_vs0, A_vs10, A_u10, A_vu0, A_vu1, A_vu10, A_u0, A_u1  
        .rega   A_y0, A_y1, A_y32:A_y10, A_vp10, A_x32:A_x10 
        .rega   A_ss, A_cc, A_mm, A_ww, A_w, A_rnd
        .regb   B_x2, B_x3, B_rr, B_nx2, B_nx3, B_nx32, B_y32, B_v32, B_vc32 
        .regb   B_vs3:B_vs2, B_vs32, B_u32, B_vu2, B_vu3, B_vu32, B_u2, B_u3    
        .regb   B_y2, B_y3, B_ny32:B_ny10, B_vp32 
        .regb   B_ss, B_cc, B_mm, B_ww, B_w, B_rnd, B_nn, B_X 
        .reg    B_i, C10, C32, Cl10, Cl32

            ADD         4,              A_X,            B_X
            MVKL        0xB67BB67B,     B_mm                    ; Q15(1.4256)                  
            MVKH        0xB67BB67B,     B_mm                    ; Q15(1.4256)
            MV          B_mm,           A_mm
            MVKL        0x60006000,     A_cc
            MVKH        0x60006000,     A_cc
            MV          A_cc,           B_cc       
            MVKL        0xFFF1FFF1,     B_rr
            MVKH        0xFFF1FFF1,     B_rr
            MV          B_rr,           A_rr
            MVKL        0x80008000,     A_ww
            MVKH        0x80008000,     A_ww
            MV          A_ww,           B_ww
            SHL         A_ww,           15,             A_w     ; 0x40000000  
            SHL         B_ww,           15,             B_w     ; 0x40000000           
            ROTL        A_w,            17,             A_rnd   ; 0x00008000
            ROTL        B_w,            17,             B_rnd   ; 0x00008000
            MVK         15,             A_ss
            MVK         15,             B_ss

            SHR         B_n,            2,              B_i
            SUB         B_i,            2,              B_i
LOOP_vrecip: .trip 16 
            LDH         *A_X++,         A_x0
            LDH         *A_X++[3],      A_x1  
            LDH         *B_X++,         B_x2
            LDH         *B_X++[3],      B_x3
            NORM        A_x0,           A_nx0
            NORM        A_x1,           A_nx1
            NORM        B_x2,           B_nx2
            NORM        B_x3,           B_nx3
            PACK2       A_nx1,          A_nx0,          A_nx10
            PACK2       B_nx3,          B_nx2,          B_nx32
            ADD2        A_rr,           A_nx10,         A_ny10
            ADD2        B_rr,           B_nx32,         B_ny32

            SSHVL       A_x0,           A_nx0,          A_x0
            SSHVL       A_x1,           A_nx1,          A_x1
            SSHVL       B_x2,           B_nx2,          B_x2
            SSHVL       B_x3,           B_nx3,          B_x3
            PACKH2      A_x1,           A_x0,           A_v10
            PACKH2      B_x3,           B_x2,           B_v32
            ; Initial value, U0=(V-.75)^2 + 1.4256-V, ~1/(2*V), 5.9233 bits
            ABS2        A_v10,          A_vp10
            ABS2        B_v32,          B_vp32
            SUB2        A_vp10,         A_cc,           A_vc10
            SUB2        B_vp32,         B_cc,           B_vc32
            SUB2        A_mm,           A_vp10,         A_u10
            SUB2        B_mm,           B_vp32,         B_u32
            SMPY2       A_vc10,         A_vc10,         A_vs1:A_vs0
            SMPY2       B_vc32,         B_vc32,         B_vs3:B_vs2
            PACKH2      A_vs1,          A_vs0,          A_vs10
            PACKH2      B_vs3,          B_vs2,          B_vs32
            ADD2        A_vs10,         A_u10,          A_u10
            ADD2        B_vs32,         B_u32,          B_u32
            SHR2        A_v10,          A_ss,           C10
            SHR2        B_v32,          B_ss,           C32
            XOR         C10,            A_u10,          A_u10
            XOR         C32,            B_u32,          B_u32
            ; Iteration 1, U1=U0*(1-V*U0), ~1/(4*V), 11.481 bits
            SMPY2       A_v10,          A_u10,          A_vu1:A_vu0
            SMPY2       B_v32,          B_u32,          B_vu3:B_vu2
            PACKH2      A_vu1,          A_vu0,          A_vu10
            PACKH2      B_vu3,          B_vu2,          B_vu32
            SUB2        A_ww,           A_vu10,         A_vu10
            SUB2        B_ww,           B_vu32,         B_vu32
            SMPY2       A_u10,          A_vu10,         A_u1:A_u0
            SMPY2       B_u32,          B_vu32,         B_u3:B_u2
            PACKH2      A_u1,           A_u0,           A_u10
            PACKH2      B_u3,           B_u2,           B_u32

            ; Iteration 2, U2=8*U1*(1/2-V*U1), ~1/(2*V), 22.585 bits
            SMPY2       A_v10,          A_u10,          A_vu1:A_vu0
            SMPY2       B_v32,          B_u32,          B_vu3:B_vu2
            SUB         A_w,            A_vu0,          A_vu0
            SUB         A_w,            A_vu1,          A_vu1
            SUB         B_w,            B_vu2,          B_vu2
            SUB         B_w,            B_vu3,          B_vu3
            MPYHIR      A_u0,           A_vu0,          A_u0
            MPYHIR      A_u1,           A_vu1,          A_u1
            MPYHIR      B_u2,           B_vu2,          B_u2
            MPYHIR      B_u3,           B_vu3,          B_u3
            SSHL        A_u0,           3,              A_u0
            SSHL        A_u1,           3,              A_u1
            SSHL        B_u2,           3,              B_u2
            SSHL        B_u3,           3,              B_u3          

            ; Fl16 = Q15(mant)|int16(expt)
            SADD        A_u0,           A_rnd,          A_y0
            SADD        A_u1,           A_rnd,          A_y1
            SADD        B_u2,           B_rnd,          B_y2
            SADD        B_u3,           B_rnd,          B_y3
            PACKH2      A_y1,           A_y0,           A_y10
            PACKH2      B_y3,           B_y2,           B_y32
            MV          B_y32,          A_y32
            MV          A_ny10,         B_ny10

            STDW        A_y32:A_y10,    *B_YM++
            STDW        B_ny32:B_ny10,  *A_YE++

            BDEC        LOOP_vrecip,    B_i