## Computing Reciprocals of Fixed-Point Numbers by the Newton-Raphson Method — Division by Multiplication

August 11, 20115 comments Coded in ASM for the TI C64x

In some adaptive filters, say, normalized least mean squares adaptive filter  (NLMS) and multidelay block frequency domain adaptive filter (MDF), update to the filter coefficients is normalized for stability and faster convergence. This kind of normalization needs a computation that all fixed-point DSPs hate very much — division.

So there is a routine called 'DSP_recip16' to find reciprocals in TI's C64x and C64x+ DSP libraries ('dsp64x.lib' and 'dsp64x+.lib', respectively). Unfortunately, 'DSP_recip16' uses an inefficient algorithm: dividing by a loop of conditional subtracts, costing 8 cycles per input number (in the format of Q.15).

We have a better way — the time-honored Newton-Raphson method. Specificialy, given the fuction f(U) = 1/U - V, then
Un+1 = Un - f(Un)/f'(Un) = Un(2-VUn) → 1/V
quadratically, i.e., error shrinking squarely, or effective digits doubling each step.

The following is a hand tuned linear assembly code of finding reciprocals of a vector of Q.15 numbers based on the above method.

To compute
y[i] = ym[i] * 2^ye[i] = 1 / x[i], for i = 0,...,N-1, and -1 ≤ x[i] < 1,
where ym is the mantissa (significand) vector and ye is the exponent vector, call
DSP_vrecip16n(x, ym, ye, N).
This interface is identical to that of TI's 'DSP_recip16'.

Under CCS5.1 (Linux) and with compiler option -O2 (software pipelining turned on), 'DSP_vrecip16n' takes 55+2.5*N to compute N reciprocals. It must be noted that N here is supposed to be a multiple of 4 (loading 4 short numbers a time) and greater than or equal to 24 (software pipelining).  If N is not a multiple of 4, 'DSP_vrecip16n' effectively operate on 4*floor(N/4) inputs. If N is less than 24, the results might be wrong since software pipelining is broken in this case.

The maximum relative error is 2-16. The storage type of the input x[i] and the mantissa part of the output ym[i] is 16-bit short, but arithmetically, they are interpreted as Q.15 fractional numbers, equal to x[i]/215 and ym[i]/215, respectively. However, the exponent part of the output ye[i] is an integer. All these conventions are the same as TI's 'DSP_recip16'.

The specific algorithm used here is

•     normalize abs(x[i]) to [.5, 1) and denote it by V;
•     initialization: U0 = (V - .75)2 + 1.4256 - V, quadratic fitting of 1/(2V) with 5.9233 effective bits;
•     iteration 1: U1 = U0(1 - VU0), ~ 1/(4V) with 11.481 effective bits;
•     iteration 2: U2 = 8U1(.5 - VU1), ~ 1/(2V) with 22.585 effective bits;
•     output 1/x[i] in the format Fl16, or Q.15 mantissa | Int16 exponent.

With the quadratic convergence of the Newton-Raphson method and a fine-tuned initial value, in just 2 iterations, we have 22.585 effective bits — more than could be stored in Fl16.

Now, C64x and C64x+ could hate division less:)

``````* =========================================================================== *
*                                                                             *
*  Compute reciprocals of a Q.15 vector by Newton-Raphson method              *
*      y[i] = ym[i] * 2^ye[i] = 1 / x[i], -1 <= x[i] < 1 and x[i] != 0        *
*      where ym is the mantissa vector and ye is the exponent vector          *
*                                                                             *
*  C prototype:                                                               *
*       void DSP_vrecip16n(short* x,    // Input, Q.15 vector                 *
*                          short* ym,   // Output, Q.15 vector                *
*                          short* ye,   // Output, int16 vector               *
*                          int    N);   // Input, vector length               *
*                                                                             *
*  Restriction:                                                               *
*       (N % 4) == 0 and N >= 24                                              *
*                                                                             *
*  Performance:                                                               *
*       55+2.5*N (software pipelining enabled by -O2, CCS5.1)                 *
*                                                                             *
*  Relative error:                                                            *
*       (-2^-16, 2^-16)                                                       *
*                                                                             *
*  Algorithm:                                                                 *
*       Input: V, abs(x[i]) normalized to [.5, 1)                             *
*       Initialization: U0 = (V-.75)^2 + 1.4256-V, ~1/(2*V), 5.9233 bits      *
*       Iteration 1:    U1 = U0*(1-V*U0),          ~1/(4*V), 11.481 bits      *
*       Iteration 2:    U2 = 8*U1*(.5-V*U1),       ~1/(2*V), 22.585 bits      *
*       Output in the format of Fl16 = Q.15(mant)|Int16(expt)                 *
*                                                                             *
* =========================================================================== *

.sect ".text: _DSP_vrecip16n"
.global _DSP_vrecip16n

_DSP_vrecip16n: .cproc  A_X, B_YM, A_YE, B_n

.no_mdep
.rega   A_x0, A_x1, A_rr, A_nx0, A_nx1, A_nx10, A_ny10, A_v10, A_vc10
.rega   A_vs1:A_vs0, A_vs10, A_u10, A_vu0, A_vu1, A_vu10, A_u0, A_u1
.rega   A_y0, A_y1, A_y32:A_y10, A_vp10, A_x32:A_x10
.rega   A_ss, A_cc, A_mm, A_ww, A_w, A_rnd
.regb   B_x2, B_x3, B_rr, B_nx2, B_nx3, B_nx32, B_y32, B_v32, B_vc32
.regb   B_vs3:B_vs2, B_vs32, B_u32, B_vu2, B_vu3, B_vu32, B_u2, B_u3
.regb   B_y2, B_y3, B_ny32:B_ny10, B_vp32
.regb   B_ss, B_cc, B_mm, B_ww, B_w, B_rnd, B_nn, B_X
.reg    B_i, C10, C32, Cl10, Cl32

MVKL        0xB67BB67B,     B_mm                    ; Q15(1.4256)
MVKH        0xB67BB67B,     B_mm                    ; Q15(1.4256)
MV          B_mm,           A_mm
MVKL        0x60006000,     A_cc
MVKH        0x60006000,     A_cc
MV          A_cc,           B_cc
MVKL        0xFFF1FFF1,     B_rr
MVKH        0xFFF1FFF1,     B_rr
MV          B_rr,           A_rr
MVKL        0x80008000,     A_ww
MVKH        0x80008000,     A_ww
MV          A_ww,           B_ww
SHL         A_ww,           15,             A_w     ; 0x40000000
SHL         B_ww,           15,             B_w     ; 0x40000000
ROTL        A_w,            17,             A_rnd   ; 0x00008000
ROTL        B_w,            17,             B_rnd   ; 0x00008000
MVK         15,             A_ss
MVK         15,             B_ss

SHR         B_n,            2,              B_i
SUB         B_i,            2,              B_i

LOOP_vrecip: .trip 16
LDH         *A_X++,         A_x0
LDH         *A_X++[3],      A_x1
LDH         *B_X++,         B_x2
LDH         *B_X++[3],      B_x3

NORM        A_x0,           A_nx0
NORM        A_x1,           A_nx1
NORM        B_x2,           B_nx2
NORM        B_x3,           B_nx3
PACK2       A_nx1,          A_nx0,          A_nx10
PACK2       B_nx3,          B_nx2,          B_nx32

SSHVL       A_x0,           A_nx0,          A_x0
SSHVL       A_x1,           A_nx1,          A_x1
SSHVL       B_x2,           B_nx2,          B_x2
SSHVL       B_x3,           B_nx3,          B_x3
PACKH2      A_x1,           A_x0,           A_v10
PACKH2      B_x3,           B_x2,           B_v32

; Initial value, U0=(V-.75)^2 + 1.4256-V, ~1/(2*V), 5.9233 bits
ABS2        A_v10,          A_vp10
ABS2        B_v32,          B_vp32
SUB2        A_vp10,         A_cc,           A_vc10
SUB2        B_vp32,         B_cc,           B_vc32
SUB2        A_mm,           A_vp10,         A_u10
SUB2        B_mm,           B_vp32,         B_u32
SMPY2       A_vc10,         A_vc10,         A_vs1:A_vs0
SMPY2       B_vc32,         B_vc32,         B_vs3:B_vs2
PACKH2      A_vs1,          A_vs0,          A_vs10
PACKH2      B_vs3,          B_vs2,          B_vs32
SHR2        A_v10,          A_ss,           C10
SHR2        B_v32,          B_ss,           C32
XOR         C10,            A_u10,          A_u10
XOR         C32,            B_u32,          B_u32

; Iteration 1, U1=U0*(1-V*U0), ~1/(4*V), 11.481 bits
SMPY2       A_v10,          A_u10,          A_vu1:A_vu0
SMPY2       B_v32,          B_u32,          B_vu3:B_vu2
PACKH2      A_vu1,          A_vu0,          A_vu10
PACKH2      B_vu3,          B_vu2,          B_vu32
SUB2        A_ww,           A_vu10,         A_vu10
SUB2        B_ww,           B_vu32,         B_vu32
SMPY2       A_u10,          A_vu10,         A_u1:A_u0
SMPY2       B_u32,          B_vu32,         B_u3:B_u2
PACKH2      A_u1,           A_u0,           A_u10
PACKH2      B_u3,           B_u2,           B_u32

; Iteration 2, U2=8*U1*(1/2-V*U1), ~1/(2*V), 22.585 bits
SMPY2       A_v10,          A_u10,          A_vu1:A_vu0
SMPY2       B_v32,          B_u32,          B_vu3:B_vu2
SUB         A_w,            A_vu0,          A_vu0
SUB         A_w,            A_vu1,          A_vu1
SUB         B_w,            B_vu2,          B_vu2
SUB         B_w,            B_vu3,          B_vu3
MPYHIR      A_u0,           A_vu0,          A_u0
MPYHIR      A_u1,           A_vu1,          A_u1
MPYHIR      B_u2,           B_vu2,          B_u2
MPYHIR      B_u3,           B_vu3,          B_u3
SSHL        A_u0,           3,              A_u0
SSHL        A_u1,           3,              A_u1
SSHL        B_u2,           3,              B_u2
SSHL        B_u3,           3,              B_u3

; Fl16 = Q15(mant)|int16(expt)