DSPRelated.com
Forums

Entropy before and after Huffmann coding vs. channel capacity

Started by qfran July 3, 2015
Dear all,
please help me to interpret some strange(?) behaviour of Huffmann coding
and entropy. Let me explain it at an example:
Consider a source emitting four symbols x1-x4 with probabilities
P(x1)=1/3, P(x2)=1/3, P(x3)=1/6, P(x4)=1/6. 
Calculating the entropy with H=sum(P(xi)*log2(P(xi)) gives us: H=1.918
Bit/symbol. So far so good.

We perform Huffmann-Coding, arranging the symbols in the order of their
decreasing probabilities x1 - x4. We aggregate x3 and x4, having now nodes
with probabilities all to be 1/3. In the sequel we have to possibilities:
1. We first aggregate x2 with the (x3+x4) and get the Huffmann-code: x1 =
0, x2=10, x3=110, x4=111.
2. We first aggregate x1 with x2 and get the Huffmann-code x1=00, x2=01,
x3=10, x4=11.
Both codes have an average code length:
L1=1*1/3+2*1/3+3*1/6+3*1/6 = 2
L2 = 2*1/3 + 2/1/3 + 2*1/6 + 2*1/6 = 2 
So it's the same, which was to be expected and which tell's us that we
could choose either one, it doesn't matter.
The fact that the Huffmann-codes don't gain anything with respect to
natural encoding is of no importance here.

Now let's say that we want to transmit the data over a binary channel. 
From the point of view of the channel, we see a transparent bit stream of
0 and 1. The channel doesn't know that some Huffmann coding took place and
he just looks at the data to calculate the a-priori-probabilities for the
0 and 1.

In the case of the first Huffmann code, the 0 comes with the probability:
P(0) = 1/3 + 1/3*1/2 + 1/6*1/3 = 10/18 = 5/9 = 0.5556
In the case of the second Huffmann code, the 0 comes with probability:
P(0) = 1/3*2/2 + 1/3*1/2 + 1/6*1/2 = 7/12 = 0.5833
They are different! Ok, one sees from the codewords that they are
different. But this means, that the two Huffmann codes are not
equivalent!?

The corresponding entropies (if we consider the two bits 0 and 1 with
their above calculated probabilities) are:
after coding with Huffman code 1:  H=0.9799 Bit/bit
after coding with Huffmann code 2: H=0.9911 Bit/bit

If we take the entropy of the source and divide it by two (average
codeword length of the Huffmann code), we get:  H=0.9591Bit/bit. Smaller
than both. Of course, since I can't loose information.  

But after Huffmann coding it seems that I would have gained information.
Yes, Huffmann code was not able in this case to remove some redundancy.
But why part (and in the two cases different parts) of the redundancy has
turned apparently to information?

Shannon tells me, that if my channel capacity is larger than the entropy
of the source, I will be able to find a code to transmit data over the
channel without losses (of course, in the ideal case). Which is now the
entropy to compare with my channel capacity? The original one of the
source (because there is not more than that information available)? Or the
apparent entropy after the Huffmann encoding, because it's the data after
the Huffmann code, which enters the channel?

And in the later case, it would mean, that average codeword length of the
Huffmann code is of no interest at all (although this is the usual measure
to compare source codes), but more important would be the probabily
distribution of the 0 and 1 in the resulting bitstream!

I'm thankfull for any comments.

Best regards,
Franz






---------------------------------------
Posted through http://www.DSPRelated.com
On Fri, 03 Jul 2015 12:41:48 -0500, qfran wrote:

> Dear all, > please help me to interpret some strange(?) behaviour of Huffmann coding > and entropy. Let me explain it at an example: > Consider a source emitting four symbols x1-x4 with probabilities > P(x1)=1/3, P(x2)=1/3, P(x3)=1/6, P(x4)=1/6. > Calculating the entropy with H=sum(P(xi)*log2(P(xi)) gives us: H=1.918 > Bit/symbol. So far so good. > > We perform Huffmann-Coding, arranging the symbols in the order of their > decreasing probabilities x1 - x4. We aggregate x3 and x4, having now > nodes with probabilities all to be 1/3. In the sequel we have to > possibilities: 1. We first aggregate x2 with the (x3+x4) and get the > Huffmann-code: x1 = 0, x2=10, x3=110, x4=111. > 2. We first aggregate x1 with x2 and get the Huffmann-code x1=00, x2=01, > x3=10, x4=11. > Both codes have an average code length: > L1=1*1/3+2*1/3+3*1/6+3*1/6 = 2 L2 = 2*1/3 + 2/1/3 + 2*1/6 + 2*1/6 = 2 So > it's the same, which was to be expected and which tell's us that we > could choose either one, it doesn't matter. > The fact that the Huffmann-codes don't gain anything with respect to > natural encoding is of no importance here. > > Now let's say that we want to transmit the data over a binary channel. > From the point of view of the channel, we see a transparent bit stream > of 0 and 1. The channel doesn't know that some Huffmann coding took > place and he just looks at the data to calculate the > a-priori-probabilities for the 0 and 1. > > In the case of the first Huffmann code, the 0 comes with the > probability: P(0) = 1/3 + 1/3*1/2 + 1/6*1/3 = 10/18 = 5/9 = 0.5556 In > the case of the second Huffmann code, the 0 comes with probability: P(0) > = 1/3*2/2 + 1/3*1/2 + 1/6*1/2 = 7/12 = 0.5833 They are different! Ok, > one sees from the codewords that they are different. But this means, > that the two Huffmann codes are not equivalent!? > > The corresponding entropies (if we consider the two bits 0 and 1 with > their above calculated probabilities) are: > after coding with Huffman code 1: H=0.9799 Bit/bit after coding with > Huffmann code 2: H=0.9911 Bit/bit > > If we take the entropy of the source and divide it by two (average > codeword length of the Huffmann code), we get: H=0.9591Bit/bit. Smaller > than both. Of course, since I can't loose information. > > But after Huffmann coding it seems that I would have gained information. > Yes, Huffmann code was not able in this case to remove some redundancy. > But why part (and in the two cases different parts) of the redundancy > has turned apparently to information? > > Shannon tells me, that if my channel capacity is larger than the entropy > of the source, I will be able to find a code to transmit data over the > channel without losses (of course, in the ideal case). Which is now the > entropy to compare with my channel capacity? The original one of the > source (because there is not more than that information available)? Or > the apparent entropy after the Huffmann encoding, because it's the data > after the Huffmann code, which enters the channel? > > And in the later case, it would mean, that average codeword length of > the Huffmann code is of no interest at all (although this is the usual > measure to compare source codes), but more important would be the > probabily distribution of the 0 and 1 in the resulting bitstream! > > I'm thankfull for any comments.
Hey Franz. I'm not a super-expert at this stuff, but I think that the break in your logic is where you take Shannon's "you can find a code" and you turn it into "so I'm going to use a Huffmann code with a very short dictionary". Shannon's "you'll be able to find a code" is very much in the mathematical sense: the channel capacity theorem says that there is a _possibility_ that there's a code out there that'll do the job. It doesn't say that there's a _certainty_ that there's a code out there that'll do the job, it doesn't say that there's even a likelihood that humans will be able to find it before we extinct ourselves -- it just puts a bound on any candidate code that you might use for the purpose. Also, you may want to re-check your sources. I'm not going to check right now, but I'm pretty sure that the Shannon channel capacity theorem gives you a probability of successful transmission, not a certainty. I believe that, no matter how elaborate the code or how far under the actual channel capacity you're operating, there's always a small but finite probability of error. To my knowledge, you can finagle that probability to be smaller, but you can never make it go away entirely. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
>On Fri, 03 Jul 2015 12:41:48 -0500, qfran wrote:
> >Hey Franz. > >I'm not a super-expert at this stuff, but I think that the break in your
>logic is where you take Shannon's "you can find a code" and you turn it >into "so I'm going to use a Huffmann code with a very short dictionary". > >Shannon's "you'll be able to find a code" is very much in the >mathematical sense: the channel capacity theorem says that there is a >_possibility_ that there's a code out there that'll do the job. It >doesn't say that there's a _certainty_ that there's a code out there >that'll do the job, it doesn't say that there's even a likelihood that >humans will be able to find it before we extinct ourselves -- it just >puts a bound on any candidate code that you might use for the purpose. > >Also, you may want to re-check your sources. I'm not going to check >right now, but I'm pretty sure that the Shannon channel capacity theorem
>gives you a probability of successful transmission, not a certainty. I >believe that, no matter how elaborate the code or how far under the >actual channel capacity you're operating, there's always a small but >finite probability of error. To my knowledge, you can finagle that >probability to be smaller, but you can never make it go away entirely. > >-- > >Tim Wescott >Wescott Design Services >http://www.wescottdesign.com
Dear Tim, thank you very much for your fast reply. I think, I was not quite clear in my first post, so please let me be more precise. I don't mean the Huffmann code, which indeed has a very short dictionary, but I refer to the channel code. And in fact I don't want to find that channel code, so may be let's put Shannons theorem the oher way round: If the channel capacity is smaller than the amount of the data, I for sure will not be able to do transmission without errors. So my question is, how large the channel capacity has to be? For example, if the channel capacity would be C=0.97, so larger than the entropy of the source, but smaller than the entropy of the data after Huffmann coding, would it in this case be sure, that I have no chance for errorless transmission (since channel capacity is smaller than entropy of data)? Or would I have a chance, since it's larger than the one of the source. In the practical case you're right that you will always have a finite probability of error, but I put the question in the ideal, limit case. Shannon says that for any epsilon > 0 you will find a code with length large enough (possibly infinite) so that remaining error probability is smaller than epsilon (typical mathemaical setup). Go to the limit and you come to epsilon -> 0. While writing this, I think that I have discovered my mistake. In order to reach the ideal case, you don't have to compare the channel capacity with the entropy, but with the amount of bits (the data rate). And this one would be in the case of both Huffmann codes 2 bit/symbol! However, more comments are welcome, especially on the fact that the two Huffmann codes are not equivalent, although they have the same average codeword length. Best regards, Franz --------------------------------------- Posted through http://www.DSPRelated.com
On Friday, July 3, 2015 at 12:51:53 PM UTC-5, Tim Wescott wrote:

> > Shannon's "you'll be able to find a code" is very much in the > mathematical sense: the channel capacity theorem says that there is a > _possibility_ that there's a code out there that'll do the job. It > doesn't say that there's a _certainty_ that there's a code out there > that'll do the job, it doesn't say that there's even a likelihood that > humans will be able to find it before we extinct ourselves -- it just > puts a bound on any candidate code that you might use for the purpose.
Oh, please, Tim! That is a completely nonsensical statement, and as your academic grandfather (Per Enge's PhD advisor), I am embarrassed that you have opted to make it in public -- it reflects badly on Per (and on me). Shannon calculated the _average_ error rate (averaged over all possible codes of a given rate R and block length n) and showed if R was smaller than a certain number that he called the capacity C, then the _average_ error rate could be made as small as one liked by increasing n sufficiently. (Conversely, if R > C, then the average (bit) error rate converged to 1/2 as n increased.) Since there must be codes in the ensemble of codes of rate R and (large) block length n which have error rate at least as small as the _average_, Shannon's theorems **guarantee** the existence of good codes (meaning with small error rate) of code rate R < C and of long block length n. It is incorrect to assert didactically, as Tim Wescott does, that "It doesn't say that there's a _certainty_ that there's a code out there that'll do the job" Instead Shannon's work shows that it is _certain_ that there is a code out there that will do the job, but Shannon provided no guidance about how one might go about finding such a code, and brute force search for a good code is prohibitively time-consuming when n is large. Even more practically, it is not just that we need to find a good code, but we are looking for an even rarer bird: a good code whose decoder is relatively easy to implement. **That** is a much harder search job. Dilip Sarwate
On 07/04/2015 10:36 AM, dvsarwate wrote:
> On Friday, July 3, 2015 at 12:51:53 PM UTC-5, Tim Wescott wrote: > >> >> Shannon's "you'll be able to find a code" is very much in the >> mathematical sense: the channel capacity theorem says that there is a >> _possibility_ that there's a code out there that'll do the job. It >> doesn't say that there's a _certainty_ that there's a code out there >> that'll do the job, it doesn't say that there's even a likelihood that >> humans will be able to find it before we extinct ourselves -- it just >> puts a bound on any candidate code that you might use for the purpose. > > > Oh, please, Tim! That is a completely nonsensical statement, > and as your academic grandfather (Per Enge's PhD advisor), I am > embarrassed that you have opted to make it in public -- it > reflects badly on Per (and on me). > > Shannon calculated the _average_ error rate (averaged > over all possible codes of a given rate R and block length > n) and showed if R was smaller than a certain number that > he called the capacity C, then the _average_ error rate could > be made as small as one liked by increasing n sufficiently. > (Conversely, if R > C, then the average (bit) error rate converged > to 1/2 as n increased.) Since there must be codes in the > ensemble of codes of rate R and (large) block length n which > have error rate at least as small as the _average_, Shannon's > theorems **guarantee** the existence of good codes (meaning with > small error rate) of code rate R < C and of long block length n. > It is incorrect to assert didactically, as Tim Wescott does, that > > "It doesn't say that there's a _certainty_ that there's a code > out there that'll do the job" > > Instead Shannon's work shows that it is _certain_ that there > is a code out there that will do the job, but Shannon provided > no guidance about how one might go about finding such a code, > and brute force search for a good code is prohibitively > time-consuming when n is large. Even more practically, it is > not just that we need to find a good code, but we are looking > for an even rarer bird: a good code whose decoder is relatively > easy to implement. **That** is a much harder search job. > > Dilip Sarwate
I think Tim maybe confusing "a code exists" with "a practical code exists". With sufficient processing power available, the practically comes down to how much latency you can tolerate. What is interesting is that people are now finding codes, like polar codes, which can get damned close to Shannon without massive latency. Steve
On Fri, 03 Jul 2015 19:36:06 -0700, dvsarwate wrote:

> On Friday, July 3, 2015 at 12:51:53 PM UTC-5, Tim Wescott wrote: > > >> Shannon's "you'll be able to find a code" is very much in the >> mathematical sense: the channel capacity theorem says that there is a >> _possibility_ that there's a code out there that'll do the job. It >> doesn't say that there's a _certainty_ that there's a code out there >> that'll do the job, it doesn't say that there's even a likelihood that >> humans will be able to find it before we extinct ourselves -- it just >> puts a bound on any candidate code that you might use for the purpose. > > > Oh, please, Tim! That is a completely nonsensical statement, > and as your academic grandfather (Per Enge's PhD advisor), I am > embarrassed that you have opted to make it in public -- it reflects > badly on Per (and on me). > > Shannon calculated the _average_ error rate (averaged over all possible > codes of a given rate R and block length n) and showed if R was smaller > than a certain number that he called the capacity C, then the _average_ > error rate could be made as small as one liked by increasing n > sufficiently. (Conversely, if R > C, then the average (bit) error rate > converged to 1/2 as n increased.) Since there must be codes in the > ensemble of codes of rate R and (large) block length n which have error > rate at least as small as the _average_, Shannon's theorems > **guarantee** the existence of good codes (meaning with small error > rate) of code rate R < C and of long block length n. It is incorrect to > assert didactically, as Tim Wescott does, that > > "It doesn't say that there's a _certainty_ that there's a code out there > that'll do the job" > > Instead Shannon's work shows that it is _certain_ that there is a code > out there that will do the job, but Shannon provided no guidance about > how one might go about finding such a code, > and brute force search for a good code is prohibitively time-consuming > when n is large. Even more practically, it is not just that we need to > find a good code, but we are looking for an even rarer bird: a good code > whose decoder is relatively easy to implement. **That** is a much > harder search job. > > Dilip Sarwate
Hmm. I did say I wasn't a super-expert. I'll go read Shannon's paper on the subject (79 pages -- egad!). -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
On Sat, 04 Jul 2015 12:04:10 +0800, Steve Underwood wrote:

> On 07/04/2015 10:36 AM, dvsarwate wrote: >> On Friday, July 3, 2015 at 12:51:53 PM UTC-5, Tim Wescott wrote: >> >> >>> Shannon's "you'll be able to find a code" is very much in the >>> mathematical sense: the channel capacity theorem says that there is a >>> _possibility_ that there's a code out there that'll do the job. It >>> doesn't say that there's a _certainty_ that there's a code out there >>> that'll do the job, it doesn't say that there's even a likelihood that >>> humans will be able to find it before we extinct ourselves -- it just >>> puts a bound on any candidate code that you might use for the purpose. >> >> >> Oh, please, Tim! That is a completely nonsensical statement, >> and as your academic grandfather (Per Enge's PhD advisor), I am >> embarrassed that you have opted to make it in public -- it reflects >> badly on Per (and on me). >> >> Shannon calculated the _average_ error rate (averaged over all possible >> codes of a given rate R and block length n) and showed if R was smaller >> than a certain number that he called the capacity C, then the _average_ >> error rate could be made as small as one liked by increasing n >> sufficiently. (Conversely, if R > C, then the average (bit) error rate >> converged to 1/2 as n increased.) Since there must be codes in the >> ensemble of codes of rate R and (large) block length n which have error >> rate at least as small as the _average_, Shannon's theorems >> **guarantee** the existence of good codes (meaning with small error >> rate) of code rate R < C and of long block length n. It is incorrect to >> assert didactically, as Tim Wescott does, that >> >> "It doesn't say that there's a _certainty_ that there's a code out >> there that'll do the job" >> >> Instead Shannon's work shows that it is _certain_ that there is a code >> out there that will do the job, but Shannon provided no guidance about >> how one might go about finding such a code, >> and brute force search for a good code is prohibitively time-consuming >> when n is large. Even more practically, it is not just that we need to >> find a good code, but we are looking for an even rarer bird: a good >> code whose decoder is relatively easy to implement. **That** is a much >> harder search job. >> >> Dilip Sarwate > > I think Tim maybe confusing "a code exists" with "a practical code > exists". With sufficient processing power available, the practically > comes down to how much latency you can tolerate. What is interesting is > that people are now finding codes, like polar codes, which can get > damned close to Shannon without massive latency.
No, Tim was forgetting what he'd learned. Dammit. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
On July 3, Steve Underwood wrote:
> > Shannon calculated the _average_ error rate (averaged > > over all possible codes of a given rate R and block length > > n) and showed if R was smaller than a certain number that > > he called the capacity C, then the _average_ error rate could > > be made as small as one liked by increasing n sufficiently. > > (Conversely, if R > C, then the average (bit) error rate converged > > to 1/2 as n increased.) Since there must be codes in the > > ensemble of codes of rate R and (large) block length n which > > have error rate at least as small as the _average_, Shannon's > > theorems **guarantee** the existence of good codes (meaning with > > small error rate) of code rate R < C and of long block length n. > > It is incorrect to assert didactically, as Tim Wescott does, that > > > > "It doesn't say that there's a _certainty_ that there's a code > > out there that'll do the job" > > > > Instead Shannon's work shows that it is _certain_ that there > > is a code out there that will do the job, but Shannon provided > > no guidance about how one might go about finding such a code, > > and brute force search for a good code is prohibitively > > time-consuming when n is large. Even more practically, it is > > not just that we need to find a good code, but we are looking > > for an even rarer bird: a good code whose decoder is relatively > > easy to implement. **That** is a much harder search job. > > I think Tim maybe confusing "a code exists" with "a practical code > exists". With sufficient processing power available, the practically > comes down to how much latency you can tolerate. What is interesting is > that people are now finding codes, like polar codes, which can get > damned close to Shannon without massive latency.
This mystifies me. I recall learning that a Shannon optimal code decoder is exponentially complex in the block length. How does a polar code evade this, what tricks have been invented recently? -- Rich
On July 3, 2015 dvsarwate wrote:
> Shannon calculated the _average_ error rate (averaged > over all possible codes of a given rate R and block length > n) and showed if R was smaller than a certain number that > he called the capacity C, then the _average_ error rate could > be made as small as one liked by increasing n sufficiently. > (Conversely, if R > C, then the average (bit) error rate converged > to 1/2 as n increased.) Since there must be codes in the > ensemble of codes of rate R and (large) block length n which > have error rate at least as small as the _average_, Shannon's > theorems **guarantee** the existence of good codes (meaning with > small error rate) of code rate R < C and of long block length n. > It is incorrect to assert didactically, as Tim Wescott does, that > > "It doesn't say that there's a _certainty_ that there's a code > out there that'll do the job" > > Instead Shannon's work shows that it is _certain_ that there > is a code out there that will do the job, but Shannon provided > no guidance about how one might go about finding such a code,
Another crucial question occurs to me. Shannon's model is a simplex channel with additive noise. But in an urban environment, wireless, we're dealing with multi-path channel, i.e. reflections. This has been extensively studied, but I wonder: does Shannon's model, and theorems, still apply? Is it all archaic, replaced by the newer theory? -- Rich
Steve Underwood  <steveu@dis.org> wrote:

>I think Tim maybe confusing "a code exists" with "a practical code >exists". With sufficient processing power available, the practically >comes down to how much latency you can tolerate. What is interesting is >that people are now finding codes, like polar codes, which can get >damned close to Shannon without massive latency.
Is there a body of research that says polar codes are as useful as, or more useful than, other, more traditional near-capacity codes such as turbo and LDPC codes? I ask becasue these traditional codes can often get to within 0.1 to 0.2 dB of capacity anyway, for a wide range of block lengths and code rates. Steve