```Hi Eric, my implementation is going to be in hardware. Eventually anyway - for now I am writing a software emulator of the algorithm.

-Dimitri
```
```On Wed, 25 Dec 2013 07:51:38 -0800 (PST), Dimitris Mavroidis
<dmavroid@gmail.com> wrote:

>
>Thank you Dr Sarwate. I found the list of papers in your student's web page, but could not find any links to download them in pdf form. However, after searching online, I was able to find most of the relevant papers in pdf form.
>
>I will study those papers and let you know if I have more questions.
>
>Thank you so much for your kind help.
>
>Best regards,
>Dimitri Mavroeidis

This sounds like an interesting project.   Is your implementation in
hardware or software?

Eric Jacobsen
Anchor Hill Communications
http://www.anchorhill.com
```
```Thank you Dr Sarwate. I found the list of papers in your student's web page, but could not find any links to download them in pdf form. However, after searching online, I was able to find most of the relevant papers in pdf form.

I will study those papers and let you know if I have more questions.

Thank you so much for your kind help.

Best regards,
Dimitri Mavroeidis
```
```On Tuesday, December 24, 2013 2:25:44 PM UTC-6, Dimitris Mavroidis wrote:
> Dr Sarwate,
>
>
>
> Thank you very much for your reply. This makes a lot of sense, and it is indeed very elegant. I fully understand what you say.
>
>
>
> I think that the only remaining item which will take a few extra cycles is the division itself in (12). We need to calculate the reciprocal of the denominator and then multiply with the numerator. As far as I know, the only way to calculate the reciprocal would be a lookup table in combinational logic, which can be implemented to operate in multiple pipelined cycles in order to reduce the critical path.
>
>
>
> I would certainly value your input if you had some proposal about how to perform the division in a more efficient way than that.
>
>
>
> Thank you very much for all your help so far.
>
>
>
> Best regards,
>
> Dimitri Mavroeidis

I have published several papers on inversion and division
in finite fields with my student (now ex-student) Zhiyuan
Yan of the ECE Dept of Lehigh University.  Look at his web page
which has most of these available in pdf format.

--Dilip Sarwate
```
```Dr Sarwate,

Thank you very much for your reply. This makes a lot of sense, and it is indeed very elegant. I fully understand what you say.

I think that the only remaining item which will take a few extra cycles is the division itself in (12). We need to calculate the reciprocal of the denominator and then multiply with the numerator. As far as I know, the only way to calculate the reciprocal would be a lookup table in combinational logic, which can be implemented to operate in multiple pipelined cycles in order to reduce the critical path.

I would certainly value your input if you had some proposal about how to perform the division in a more efficient way than that.

Thank you very much for all your help so far.

Best regards,
Dimitri Mavroeidis
```
```On Tuesday, December 24, 2013 2:53:56 AM UTC-6, Dimitris Mavroidis wrote:
> Dr Sarwate,
>
>
>
> Thank you very much for your reply. I believe however that in this post you are referring to the Chien search. I am aware of the method that you describe, and this is what I am planning to use in my design.
>
>
>
> However my question had to do with the Forney calculation, as described in your paper in equation (12), rather than the Chien search. The Forney calculation in your paper, because of the "high-order" EEP that is calculated through the riBM and RiBM algorithms, requires the calculation of a power - Xi^(m0 - 2t). That power may become the critical path. I was wondering if you had any suggestions on how to perform the Forney calculation on each root position efficiently.
>

It is the very same thing as I described above.

You are evaluating a polynomial z^{m_0+2t}\Omega^{(h)}(z)
in which the low order coefficients are 0 (the lowest
degree term is m_0+2t) and these terms don't need to be
saved in registers, multiplied by constants in one clock
cycle etc. So you are doing the same thing as you are
doing in the Chien search. Notice also that the denominator
in (12) is actually the sum of alternate terms of what you
will be computing in the Chien search, that is, both the
error evaluation and the Chien search should be done
synchronously. The Chien search produces two answers, one
telling you if this is an error location, and the other
(the sum of alternate terms) telling you what the denominator
is in (12). Thus, we always compute a putative error value,
discarding it if this is not an error location, and using it
if this is indeed an error location.

--Dilip Sarwate
```
```Dr Sarwate,

Thank you very much for your reply. I believe however that in this post you are referring to the Chien search. I am aware of the method that you describe, and this is what I am planning to use in my design.

However my question had to do with the Forney calculation, as described in your paper in equation (12), rather than the Chien search. The Forney calculation in your paper, because of the "high-order" EEP that is calculated through the riBM and RiBM algorithms, requires the calculation of a power - Xi^(m0 - 2t). That power may become the critical path. I was wondering if you had any suggestions on how to perform the Forney calculation on each root position efficiently.

Thank you,
Dimitri Mavroeidis
```
```On Monday, December 23, 2013 11:24:28 AM UTC-6, Dimitris Mavroidis wrote:

> I am afraid that calculating Xi**(m0 - 2t) in a single clock cycle through combinational logic might be at the critical path of my design, which will force me to decrease the clock frequency accordingly. This as you understand can take a big bite out of the savings of calculating the EEP at the sane time as the ELP.
>

promptly remember Horner's rule as the way to go.  However,
in the calculations of error locations and error values,
one should NOT use Horner's rule because we need to evaluate
a polynomial not at one point but at MANY points, and in a
fixed and known order. Specifically, we need to evaluate
f(x) at 1, a, a^2, a^3, ....  IN THAT ORDER.  So,

put the coefficients f0, f1, ... in separate registers.

Their sum is f(1).

In one clock cycle, multiply the contents of the registers
by 1, a, a^2, a^3, ....  respectively.

These are scalar multiplications: the quantities a, a^2, a^3,
etc are all known to the designer, and can be pre-computed
and their values incorporated into the hardware. Multiplication
of the contents of a register by a known constant is a one-cycle
operation; all you have to do is XOR together various combinations
of the register contents and put the result back into the
register.

Thus, after one clock cycle, the registers contain f0, af1,
a^2f2, a^3f3, .... and their sum is f(a).

Lather, rinse, repeat...  The register contents in successive
clock cycles are

f0     f1      f2      f3     ......   sum = f(1)

f0    af1   a^2f2   a^3f3     ......   sum = f(a)

f0   a^2f1  a^4f2   a^6f3     ......   sum = f(a^2)

f0   a^3f1  a^6f2   a^9f3     .......  sum = f(a^3)

and so on and so forth.

--Dilip Sarwate

```
```Dr. Sarwate,

Thank you very much for your suggestion. I downloaded and read your paper, and I am indeed inclined to use that approach since as you say it can calculate both polynomials in 2t clock cycles. This is excellent.

I do have one question for you though. The error evaluator polynomial (EEP) that you calculate is not exactly the same as the one calculated through the conventional algorithms. In order to use that EEP, one needs to calculate Xi to the power of (m0 + 2t), as shown in equation (12). Of course if one is free to choose m0, and chooses m0 = -2t, this is eliminated, but in my case I can't change m0 - it has to be equal to 0 or my circuit will be incompatible with other circuits that will use m0=0 and thus a different generator polynomial.

I am afraid that calculating Xi**(m0 - 2t) in a single clock cycle through combinational logic might be at the critical path of my design, which will force me to decrease the clock frequency accordingly. This as you understand can take a big bite out of the savings of calculating the EEP at the sane time as the ELP.

Am I missing something here? Any suggestions would be very welcome.

Regards,
Dimitri Mavroidis

```
```On Friday, December 20, 2013 11:04:54 AM UTC-6, Dimitris Mavroidis wrote:
> Hello,
>
>
>
> I have been reading the excellent paper titled "A HIGH-SPEED AND LOW-LATENCY REED-SOLOMON DECODER BASED ON A DUAL-LINE STRUCTURE" by Hyeong-Ju Kang and In-Cheol Park.
>

I have a different opinion about the Kang-Park approach,
which in my opinion misses the boat entirely.

For decoders, dual-line if you want to call them that,
which find both the error-evaluator and the error-locator
polynomial in just 2t clock cycles (instead of the Kang-Park
approach of first finding the error-locator in 2t cycles,
and then finding the error-evaluator using an additional
t+1 cycles for a total latency of 3t+1 cycles), read my
paper

D. V. Sarwate and N, R. Shanbhag, "High-Speed Architectures
for Reed-Solomon Decoders, IEEE Transactions on VLSI Systems,
vol. 9, no. 5, Oct 2001.

You can find it on IEEExplore or look at this link:

http://www.ifp.illinois.edu/~sarwate/

--Dilip Sarwate

```