Computing Reciprocals of Fixed-Point Numbers by the Newton-Raphson Method — Division by Multiplication
In some adaptive filters, say, normalized least mean squares adaptive filter (NLMS) and multidelay block frequency domain adaptive filter (MDF), update to the filter coefficients is normalized for stability and faster convergence. This kind of normalization needs a computation that all fixed-point DSPs hate very much — division.
So there is a routine called 'DSP_recip16' to find reciprocals in TI's C64x and C64x+ DSP libraries ('dsp64x.lib' and 'dsp64x+.lib', respectively). Unfortunately, 'DSP_recip16' uses an inefficient algorithm: dividing by a loop of conditional subtracts, costing 8 cycles per input number (in the format of Q.15).
We have a better way — the time-honored Newton-Raphson method. Specificialy, given the fuction f(U) = 1/U - V, then
Un+1 = Un - f(Un)/f'(Un) = Un(2-VUn) → 1/V
quadratically, i.e., error shrinking squarely, or effective digits doubling each step.
The following is a hand tuned linear assembly code of finding reciprocals of a vector of Q.15 numbers based on the above method.
To compute
y[i] = ym[i] * 2^ye[i] = 1 / x[i], for i = 0,...,N-1, and -1 ≤ x[i] < 1,
where ym is the mantissa (significand) vector and ye is the exponent vector, call
DSP_vrecip16n(x, ym, ye, N).
This interface is identical to that of TI's 'DSP_recip16'.
Under CCS5.1 (Linux) and with compiler option -O2 (software pipelining turned on), 'DSP_vrecip16n' takes 55+2.5*N to compute N reciprocals. It must be noted that N here is supposed to be a multiple of 4 (loading 4 short numbers a time) and greater than or equal to 24 (software pipelining). If N is not a multiple of 4, 'DSP_vrecip16n' effectively operate on 4*floor(N/4) inputs. If N is less than 24, the results might be wrong since software pipelining is broken in this case.
The maximum relative error is 2-16. The storage type of the input x[i] and the mantissa part of the output ym[i] is 16-bit short, but arithmetically, they are interpreted as Q.15 fractional numbers, equal to x[i]/215 and ym[i]/215, respectively. However, the exponent part of the output ye[i] is an integer. All these conventions are the same as TI's 'DSP_recip16'.
The specific algorithm used here is
- normalize abs(x[i]) to [.5, 1) and denote it by V;
- initialization: U0 = (V - .75)2 + 1.4256 - V, quadratic fitting of 1/(2V) with 5.9233 effective bits;
- iteration 1: U1 = U0(1 - VU0), ~ 1/(4V) with 11.481 effective bits;
- iteration 2: U2 = 8U1(.5 - VU1), ~ 1/(2V) with 22.585 effective bits;
- output 1/x[i] in the format Fl16, or Q.15 mantissa | Int16 exponent.
With the quadratic convergence of the Newton-Raphson method and a fine-tuned initial value, in just 2 iterations, we have 22.585 effective bits — more than could be stored in Fl16.
Now, C64x and C64x+ could hate division less:)
* =========================================================================== *
* *
* Compute reciprocals of a Q.15 vector by Newton-Raphson method *
* y[i] = ym[i] * 2^ye[i] = 1 / x[i], -1 <= x[i] < 1 and x[i] != 0 *
* where ym is the mantissa vector and ye is the exponent vector *
* *
* C prototype: *
* void DSP_vrecip16n(short* x, // Input, Q.15 vector *
* short* ym, // Output, Q.15 vector *
* short* ye, // Output, int16 vector *
* int N); // Input, vector length *
* *
* Restriction: *
* (N % 4) == 0 and N >= 24 *
* *
* Performance: *
* 55+2.5*N (software pipelining enabled by -O2, CCS5.1) *
* *
* Relative error: *
* (-2^-16, 2^-16) *
* *
* Algorithm: *
* Input: V, abs(x[i]) normalized to [.5, 1) *
* Initialization: U0 = (V-.75)^2 + 1.4256-V, ~1/(2*V), 5.9233 bits *
* Iteration 1: U1 = U0*(1-V*U0), ~1/(4*V), 11.481 bits *
* Iteration 2: U2 = 8*U1*(.5-V*U1), ~1/(2*V), 22.585 bits *
* Output in the format of Fl16 = Q.15(mant)|Int16(expt) *
* *
* =========================================================================== *
.sect ".text: _DSP_vrecip16n"
.global _DSP_vrecip16n
_DSP_vrecip16n: .cproc A_X, B_YM, A_YE, B_n
.no_mdep
.rega A_x0, A_x1, A_rr, A_nx0, A_nx1, A_nx10, A_ny10, A_v10, A_vc10
.rega A_vs1:A_vs0, A_vs10, A_u10, A_vu0, A_vu1, A_vu10, A_u0, A_u1
.rega A_y0, A_y1, A_y32:A_y10, A_vp10, A_x32:A_x10
.rega A_ss, A_cc, A_mm, A_ww, A_w, A_rnd
.regb B_x2, B_x3, B_rr, B_nx2, B_nx3, B_nx32, B_y32, B_v32, B_vc32
.regb B_vs3:B_vs2, B_vs32, B_u32, B_vu2, B_vu3, B_vu32, B_u2, B_u3
.regb B_y2, B_y3, B_ny32:B_ny10, B_vp32
.regb B_ss, B_cc, B_mm, B_ww, B_w, B_rnd, B_nn, B_X
.reg B_i, C10, C32, Cl10, Cl32
ADD 4, A_X, B_X
MVKL 0xB67BB67B, B_mm ; Q15(1.4256)
MVKH 0xB67BB67B, B_mm ; Q15(1.4256)
MV B_mm, A_mm
MVKL 0x60006000, A_cc
MVKH 0x60006000, A_cc
MV A_cc, B_cc
MVKL 0xFFF1FFF1, B_rr
MVKH 0xFFF1FFF1, B_rr
MV B_rr, A_rr
MVKL 0x80008000, A_ww
MVKH 0x80008000, A_ww
MV A_ww, B_ww
SHL A_ww, 15, A_w ; 0x40000000
SHL B_ww, 15, B_w ; 0x40000000
ROTL A_w, 17, A_rnd ; 0x00008000
ROTL B_w, 17, B_rnd ; 0x00008000
MVK 15, A_ss
MVK 15, B_ss
SHR B_n, 2, B_i
SUB B_i, 2, B_i
LOOP_vrecip: .trip 16
LDH *A_X++, A_x0
LDH *A_X++[3], A_x1
LDH *B_X++, B_x2
LDH *B_X++[3], B_x3
NORM A_x0, A_nx0
NORM A_x1, A_nx1
NORM B_x2, B_nx2
NORM B_x3, B_nx3
PACK2 A_nx1, A_nx0, A_nx10
PACK2 B_nx3, B_nx2, B_nx32
ADD2 A_rr, A_nx10, A_ny10
ADD2 B_rr, B_nx32, B_ny32
SSHVL A_x0, A_nx0, A_x0
SSHVL A_x1, A_nx1, A_x1
SSHVL B_x2, B_nx2, B_x2
SSHVL B_x3, B_nx3, B_x3
PACKH2 A_x1, A_x0, A_v10
PACKH2 B_x3, B_x2, B_v32
; Initial value, U0=(V-.75)^2 + 1.4256-V, ~1/(2*V), 5.9233 bits
ABS2 A_v10, A_vp10
ABS2 B_v32, B_vp32
SUB2 A_vp10, A_cc, A_vc10
SUB2 B_vp32, B_cc, B_vc32
SUB2 A_mm, A_vp10, A_u10
SUB2 B_mm, B_vp32, B_u32
SMPY2 A_vc10, A_vc10, A_vs1:A_vs0
SMPY2 B_vc32, B_vc32, B_vs3:B_vs2
PACKH2 A_vs1, A_vs0, A_vs10
PACKH2 B_vs3, B_vs2, B_vs32
ADD2 A_vs10, A_u10, A_u10
ADD2 B_vs32, B_u32, B_u32
SHR2 A_v10, A_ss, C10
SHR2 B_v32, B_ss, C32
XOR C10, A_u10, A_u10
XOR C32, B_u32, B_u32
; Iteration 1, U1=U0*(1-V*U0), ~1/(4*V), 11.481 bits
SMPY2 A_v10, A_u10, A_vu1:A_vu0
SMPY2 B_v32, B_u32, B_vu3:B_vu2
PACKH2 A_vu1, A_vu0, A_vu10
PACKH2 B_vu3, B_vu2, B_vu32
SUB2 A_ww, A_vu10, A_vu10
SUB2 B_ww, B_vu32, B_vu32
SMPY2 A_u10, A_vu10, A_u1:A_u0
SMPY2 B_u32, B_vu32, B_u3:B_u2
PACKH2 A_u1, A_u0, A_u10
PACKH2 B_u3, B_u2, B_u32
; Iteration 2, U2=8*U1*(1/2-V*U1), ~1/(2*V), 22.585 bits
SMPY2 A_v10, A_u10, A_vu1:A_vu0
SMPY2 B_v32, B_u32, B_vu3:B_vu2
SUB A_w, A_vu0, A_vu0
SUB A_w, A_vu1, A_vu1
SUB B_w, B_vu2, B_vu2
SUB B_w, B_vu3, B_vu3
MPYHIR A_u0, A_vu0, A_u0
MPYHIR A_u1, A_vu1, A_u1
MPYHIR B_u2, B_vu2, B_u2
MPYHIR B_u3, B_vu3, B_u3
SSHL A_u0, 3, A_u0
SSHL A_u1, 3, A_u1
SSHL B_u2, 3, B_u2
SSHL B_u3, 3, B_u3
; Fl16 = Q15(mant)|int16(expt)
SADD A_u0, A_rnd, A_y0
SADD A_u1, A_rnd, A_y1
SADD B_u2, B_rnd, B_y2
SADD B_u3, B_rnd, B_y3
PACKH2 A_y1, A_y0, A_y10
PACKH2 B_y3, B_y2, B_y32
MV B_y32, A_y32
MV A_ny10, B_ny10
STDW A_y32:A_y10, *B_YM++
STDW B_ny32:B_ny10, *A_YE++
BDEC LOOP_vrecip, B_i
.endproc