DSPRelated.com
Forums

distributed arithmetic symmetry?

Started by dtsao November 14, 2006
Hi all,

I am trying to build a FIR filter using distributed arithmetic. Many
papers say I can save space because of the symmetry, thus halving the
number of coefficients and inputs to the LUTs. However, I don't see how
this is possible for LUT. This is my point:
1) If I don't use symmetry, the address to the LUT is just binary, because
either that input bit is 0 or 1. So for a 8 tap filter, the size of the LUT
is 2^8.
2) Many claim that because of symmetry, this can be reduced to 2^4, by
summing the appropriate inputs before going into LUT (i.e. (x1+x8)*C1,
(x2+x7)*C2...). BUT, how can these be used as the same address to the LUT.
Because now, each part of the address must have 3 possible values (0, 1,
2). That is, 1 if exclusively one of x1 or x8 is present, or 2 if both are
there. Therefore, It seems the LUT will be of size 3^(N/2) where N is the
number of taps, instead of 2^N. 
Am I missing something? I don't see how using symmetry will halve the size
of the ciruit. Thanks.



dtsao wrote:
> Hi all, > > I am trying to build a FIR filter using distributed arithmetic. Many > papers say I can save space because of the symmetry, thus halving the > number of coefficients and inputs to the LUTs. However, I don't see how > this is possible for LUT. This is my point: > 1) If I don't use symmetry, the address to the LUT is just binary, because > either that input bit is 0 or 1. So for a 8 tap filter, the size of the LUT > is 2^8. > 2) Many claim that because of symmetry, this can be reduced to 2^4, by > summing the appropriate inputs before going into LUT (i.e. (x1+x8)*C1, > (x2+x7)*C2...). BUT, how can these be used as the same address to the LUT. > Because now, each part of the address must have 3 possible values (0, 1, > 2). That is, 1 if exclusively one of x1 or x8 is present, or 2 if both are > there. Therefore, It seems the LUT will be of size 3^(N/2) where N is the > number of taps, instead of 2^N. > Am I missing something? I don't see how using symmetry will halve the size > of the ciruit. Thanks.
The addresses of the LUT apply to the Cs. Since C2 and C7 are the same, you take advantage of the identity x2C2 + x7C7 = x2C2 + x7C2 = (x2+x7)C2 to save a place in the LUT (no C7) and a multiplication. If the original LUT has an even number of entries, the reduced one has half that. If it has an odd number of entries, the reduced one has half plus 1. Jerry -- Engineering is the art of making what you want from things you can get. �����������������������������������������������������������������������
> >The addresses of the LUT apply to the Cs. Since C2 and C7 are the same, >you take advantage of the identity x2C2 + x7C7 = x2C2 + x7C2 = (x2+x7)C2
>to save a place in the LUT (no C7) and a multiplication. If the original
>LUT has an even number of entries, the reduced one has half that. If it >has an odd number of entries, the reduced one has half plus 1. > >Jerry >-- >Engineering is the art of making what you want from things you can get. >����������������������������������������������������������������������� >
Hi, I know what you mean. This was my original understanding, but I don't see how this works. Because the address of the LUT is based on x1 being present in the input, therefore being multiplied by the C1. So if both x1 and x8 are present, this cannot be represented in the address by 1, it must be represented by 2 (x1+x8), right? So won't the lookup table look something like: 0 0 0 | 0 0 0 1 | C1 (exclusively x1 or x8 is present) 0 0 2 | 2C1 (both x1 AND x8 are present) 0 1 0 | C2 (exclusively x2 or x7 is present) 0 1 1 | C1+C2 0 1 2 | C2+ 2C1 0 2 0 | 2C2 .. | 2 2 2 | 2C1+2C2+2C3 I see your point that C7=C2 so we can factor it. I just think the addresses in the LUT cannot be the same. Shouldn't there be different addresses for the case when (x2=1 and x7=1) and (x2=1 and x7=0)? Because if we use the same LUT, x2+x7 could be equal 2 (i.e. 10 in binary) which overflows the address input, resulting in 0. Thanks.
dtsao wrote:
>>The addresses of the LUT apply to the Cs. Since C2 and C7 are the same, >>you take advantage of the identity x2C2 + x7C7 = x2C2 + x7C2 = (x2+x7)C2 > > >>to save a place in the LUT (no C7) and a multiplication. If the original > > >>LUT has an even number of entries, the reduced one has half that. If it >>has an odd number of entries, the reduced one has half plus 1. >> >>Jerry >>-- >>Engineering is the art of making what you want from things you can get. >>����������������������������������������������������������������������� >> > > > Hi, > > I know what you mean. This was my original understanding, but I don't see > how this works. Because the address of the LUT is based on x1 being > present in the input, therefore being multiplied by the C1. So if both x1 > and x8 are present, this cannot be represented in the address by 1, it > must be represented by 2 (x1+x8), right? > So won't the lookup table look something like: > > 0 0 0 | 0 > 0 0 1 | C1 (exclusively x1 or x8 is present) > 0 0 2 | 2C1 (both x1 AND x8 are present) > 0 1 0 | C2 (exclusively x2 or x7 is present) > 0 1 1 | C1+C2 > 0 1 2 | C2+ 2C1 > 0 2 0 | 2C2 > .. | > 2 2 2 | 2C1+2C2+2C3 > > I see your point that C7=C2 so we can factor it. I just think the > addresses in the LUT cannot be the same. Shouldn't there be different > addresses for the case when (x2=1 and x7=1) and (x2=1 and x7=0)? Because > if we use the same LUT, x2+x7 could be equal 2 (i.e. 10 in binary) which > overflows the address input, resulting in 0. > Thanks. >
Symmetry is handled by putting pre-adders in front of the LUTs so that the data presented to the lut address is the sum of the data to the two taps. For serial distributed arithmetic, (which is the compact form usually considered), the data is presented bit serially so that the N bits of the input are presented at a tap (one bit of the LUT address) on N successive clock cycles. Successive taps have the serial data delayed by N clocks so that the same bit of the previous sample is presented to the next tap. If you are to take advantage of symmetry, a serial adder is inserted between the tapped serial delay and the LUT address. As with the original data, the serially summed data due to symmetry is presented one bit at a time (generally LSB first in order to allow a simple bit serial adder implementation) to the LUT. There has to be one additional bit time per word in order to handle the maximum sum of the folded data. The bit serial adders occupy two 3 input, one bit output LUTs, one for the sum function and one for the carry function. The carry is wrapped back around to the inputs of both so that only the sum is output to the DA LUT. Does that clear it up at all?
dtsao wrote:
>> The addresses of the LUT apply to the Cs. Since C2 and C7 are the same, >> you take advantage of the identity x2C2 + x7C7 = x2C2 + x7C2 = (x2+x7)C2 > >> to save a place in the LUT (no C7) and a multiplication. If the original > >> LUT has an even number of entries, the reduced one has half that. If it >> has an odd number of entries, the reduced one has half plus 1. >> >> Jerry >> -- >> Engineering is the art of making what you want from things you can get. >> ����������������������������������������������������������������������� >> > > Hi, > > I know what you mean. This was my original understanding, but I don't see > how this works. Because the address of the LUT is based on x1 being > present in the input, therefore being multiplied by the C1. So if both x1 > and x8 are present, this cannot be represented in the address by 1, it > must be represented by 2 (x1+x8), right?
No. The x has nothing to do with which C is needed.
> So won't the lookup table look something like: > > 0 0 0 | 0 > 0 0 1 | C1 (exclusively x1 or x8 is present) > 0 0 2 | 2C1 (both x1 AND x8 are present) > 0 1 0 | C2 (exclusively x2 or x7 is present) > 0 1 1 | C1+C2 > 0 1 2 | C2+ 2C1 > 0 2 0 | 2C2 > .. | > 2 2 2 | 2C1+2C2+2C3
Not if you are implementing an FIR filter. the xs run by in a shift register, are summed in pairs, multiplied by the Cs which apply to the fixed summers, then the products summed to produce a single result. The data are then advanced in the shift register and the process repeated.
> > I see your point that C7=C2 so we can factor it. I just think the > addresses in the LUT cannot be the same. Shouldn't there be different > addresses for the case when (x2=1 and x7=1) and (x2=1 and x7=0)? Because > if we use the same LUT, x2+x7 could be equal 2 (i.e. 10 in binary) which > overflows the address input, resulting in 0.
Again, x in no way determines an address. Jerry -- Engineering is the art of making what you want from things you can get. �����������������������������������������������������������������������
Ray Andraka wrote:
> dtsao wrote: >>> The addresses of the LUT apply to the Cs. Since C2 and C7 are the >>> same, you take advantage of the identity x2C2 + x7C7 = x2C2 + x7C2 = >>> (x2+x7)C2 >> >> >>> to save a place in the LUT (no C7) and a multiplication. If the original >> >> >>> LUT has an even number of entries, the reduced one has half that. If >>> it has an odd number of entries, the reduced one has half plus 1. >>> >>> Jerry >>> -- >>> Engineering is the art of making what you want from things you can get. >>> ����������������������������������������������������������������������� >>> >> >> >> Hi, >> >> I know what you mean. This was my original understanding, but I don't see >> how this works. Because the address of the LUT is based on x1 being >> present in the input, therefore being multiplied by the C1. So if both x1 >> and x8 are present, this cannot be represented in the address by 1, it >> must be represented by 2 (x1+x8), right? >> So won't the lookup table look something like: >> >> 0 0 0 | 0 >> 0 0 1 | C1 (exclusively x1 or x8 is present) >> 0 0 2 | 2C1 (both x1 AND x8 are present) >> 0 1 0 | C2 (exclusively x2 or x7 is present) >> 0 1 1 | C1+C2 >> 0 1 2 | C2+ 2C1 >> 0 2 0 | 2C2 >> .. | >> 2 2 2 | 2C1+2C2+2C3 >> >> I see your point that C7=C2 so we can factor it. I just think the >> addresses in the LUT cannot be the same. Shouldn't there be different >> addresses for the case when (x2=1 and x7=1) and (x2=1 and x7=0)? Because >> if we use the same LUT, x2+x7 could be equal 2 (i.e. 10 in binary) which >> overflows the address input, resulting in 0. >> Thanks. >> > > > Symmetry is handled by putting pre-adders in front of the LUTs so that > the data presented to the lut address is the sum of the data to the two > taps. For serial distributed arithmetic, (which is the compact form > usually considered), the data is presented bit serially so that the N > bits of the input are presented at a tap (one bit of the LUT address) on > N successive clock cycles. Successive taps have the serial data delayed > by N clocks so that the same bit of the previous sample is presented to > the next tap. If you are to take advantage of symmetry, a serial adder > is inserted between the tapped serial delay and the LUT address. As > with the original data, the serially summed data due to symmetry is > presented one bit at a time (generally LSB first in order to allow a > simple bit serial adder implementation) to the LUT. There has to be one > additional bit time per word in order to handle the maximum sum of the > folded data. The bit serial adders occupy two 3 input, one bit output > LUTs, one for the sum function and one for the carry function. The > carry is wrapped back around to the inputs of both so that only the sum > is output to the DA LUT. > > Does that clear it up at all?
Well, it clears it up for me. I was on the track to the wrong station! Jerry -- Engineering is the art of making what you want from things you can get. �����������������������������������������������������������������������
>> Symmetry is handled by putting pre-adders in front of the LUTs so that >> the data presented to the lut address is the sum of the data to the two
>> taps. For serial distributed arithmetic, (which is the compact form >> usually considered), the data is presented bit serially so that the N >> bits of the input are presented at a tap (one bit of the LUT address)
on
>> N successive clock cycles. Successive taps have the serial data
delayed
>> by N clocks so that the same bit of the previous sample is presented to
>> the next tap. If you are to take advantage of symmetry, a serial adder
>> is inserted between the tapped serial delay and the LUT address. As >> with the original data, the serially summed data due to symmetry is >> presented one bit at a time (generally LSB first in order to allow a >> simple bit serial adder implementation) to the LUT. There has to be
one
>> additional bit time per word in order to handle the maximum sum of the
>> folded data. The bit serial adders occupy two 3 input, one bit output
>> LUTs, one for the sum function and one for the carry function. The >> carry is wrapped back around to the inputs of both so that only the sum
>> is output to the DA LUT. >> >> Does that clear it up at all? >
I think I see what you are saying. So if I use symmetry, it requires double the amount of time to get the output?
dtsao wrote:

> > I think I see what you are saying. So if I use symmetry, it requires > double the amount of time to get the output? > >
No, it takes one extra clock cycle of latency to account for the 1 clock latency in the serial adder. The tapped delay for a non-symmetric FIR filter is present in the symmetric one as well, but it is folded so that the first LUT input comes from the sum of delay taps 0 and N-1, the next from 1 and N-2, the next from 2 and N-3 and so on where N is the length of the filter, and delay taps have a delay of m clocks between them. M is at least the number of bits in the data input, or the number of bits in the data input plus one for a symmetric filter. Look at that first tap. In a non-symmetric filter, tap 0 feeds the first address bit of the first LUT, and there is a delay of typically one clock going through the LUT, plus any delays in the adder network that combines the partial sums from the LUTs. Further assume that the input data is presented bit serially, LSB first. Now, if we add symmetry, in involves inserting a bit serial adder in front of the LUT input corresponding to tap 0. The bit serial adder accepts data LSB first, and produces the sum, also LSB first, one clock cycle delayed from the input. The delay from tap0 is therefore just one clock cycle greater than it was without symmetry, and likewise, the delay from tap N-1 is just one clock cycle greater than it was with the non-symmetric design.
>dtsao wrote: > >> >> I think I see what you are saying. So if I use symmetry, it requires >> double the amount of time to get the output? >> >> > >No, it takes one extra clock cycle of latency to account for the 1 clock
>latency in the serial adder. The tapped delay for a non-symmetric FIR >filter is present in the symmetric one as well, but it is folded so that
>the first LUT input comes from the sum of delay taps 0 and N-1, the next
>from 1 and N-2, the next from 2 and N-3 and so on where N is the length >of the filter, and delay taps have a delay of m clocks between them. M >is at least the number of bits in the data input, or the number of bits >in the data input plus one for a symmetric filter. > >Look at that first tap. In a non-symmetric filter, tap 0 feeds the >first address bit of the first LUT, and there is a delay of typically >one clock going through the LUT, plus any delays in the adder network >that combines the partial sums from the LUTs. Further assume that the >input data is presented bit serially, LSB first. Now, if we add >symmetry, in involves inserting a bit serial adder in front of the LUT >input corresponding to tap 0. The bit serial adder accepts data LSB >first, and produces the sum, also LSB first, one clock cycle delayed >from the input. The delay from tap0 is therefore just one clock cycle >greater than it was without symmetry, and likewise, the delay from tap >N-1 is just one clock cycle greater than it was with the non-symmetric >design. >
Sorry, still not sure I got this. So there is one extra clock cycle for each bit that is going in serially to the LUT? So if the input is 4bits going in serially to the LUT, there will be 4 extra clock cycles total(one for each bit)?
dtsao wrote:


> > Sorry, still not sure I got this. So there is one extra clock cycle for > each bit that is going in serially to the LUT? So if the input is 4bits > going in serially to the LUT, there will be 4 extra clock cycles total(one > for each bit)? > >
Distributed arithmetic is usually associated with bit serial operation so that you get one LUT address bit per tap. Being bit serial, it requires a minimum of N clocks per sample for an N bit sample width. The latency through the filter via the first coefficient depends on the amount of pipelining in the design. Adding a bit serial adder to the mix increases that latency by one clock cycle: the bit serial adder produces the sum LSB first with a one clock cycle delay from the input. Also, to keep the sum from overflowing, the number of bits per sample has to be one greater than the number of bits in the input, so the minimum clocks per sample for the symmetric filter is N+1 for an N bit input, which for a given input sample rate means the minimum serial clock rate must be a little bit faster. The bit serial adder is: SUM <= A xor B xor CY CY <= MAJ3(A,B,CY) where A and B are the inputs presented one bit at a time, LSB first, SUM is the serial SUM output, produced LSB first, and CY is the carry from bit to bit.