robert bristow-johnson <rbj@audioimagination.com> wrote:
> On 4/3/14 5:00 PM, Tim Wescott wrote:
(snip)
>> I'm not sure that the Huffman would really help, however I'm
>> already considering the differential scheme.  I should look at some
>> downloads, and see what I think of the Huffman.

(snip)
> if you don't mind bit twiddling that you do with Huffman, simply first 
> do the differential encoding (and you can generalize that with LPC, so 
> you use LPC to *guess* at the next sample value and you encode the 
> difference between your guess and what the sample value really is) and 
> then do Huffman coding on the result.  many of the difference samples 
> will be zero or +1 or -1 or +2 or -2, etc.  you will not need 8 bits to 
> encode them.  and the difference between going from 8-bit differences 
> (like -128 and +127) to greater than 8 bits is not the difference of 
> doubling your word size, if you were to encode this with Huffman.  it 
> wouldn't matter.
 
(snip)

> if the data is white and uniform PDF, there ain't nothing you can do to 
> losslessly compress it.  it's just like the data came outa a real good 
> random number generator and there ain't shit you can do about it 

There is one more requirement, that the data fill up the word.

A file of random, uniform PDF ASCII letters from A to Z is
somewhat compressable, by a factor of log(26)/log(256)
(if stored in 8 bit bytes).

LZW does reasonably well at doing that, as repeated substrings
are proportionaly more probable.

Huffman will code each letter in 4 or 5 bits, so again, not so
far off. Coding each letter with the appropriate fractional
bits would do slightly better.

-- glen

On 4/3/14 5:00 PM, Tim Wescott wrote:
> On Thu, 03 Apr 2014 19:53:57 +0000, glen herrmannsfeldt wrote:
>
>> Tim Wescott<tim@seemywebsite.really>  wrote:
>>
>> (snip on need for compression)
>>
>>> The data is mostly captured analog data, and it's often in the form of
>>> small-magnitude numbers, or numbers that vary from sample to sample by
>>> a small magnitude, with the occasional large excursion (like guard
>>> duty: long stretches of boredom punctuated by occasional bursts of
>>> terror).
>>
>> So the first thing to do is generate a differential signal by
>> subtracting each from the previous value. That gets a signal that is
>> mostly small, but sometimes large. Then figure out how to compress that.
>> The optimal answer depends on the statistics of the signal.
>>
...
> I'm not sure that the Huffman would really help, however I'm
> already considering the differential scheme.  I should look at some
> downloads, and see what I think of the Huffman.
>

Tim and Glen and Julius et al.

if you don't mind bit twiddling that you do with Huffman, simply first 
do the differential encoding (and you can generalize that with LPC, so 
you use LPC to *guess* at the next sample value and you encode the 
difference between your guess and what the sample value really is) and 
then do Huffman coding on the result.  many of the difference samples 
will be zero or +1 or -1 or +2 or -2, etc.  you will not need 8 bits to 
encode them.  and the difference between going from 8-bit differences 
(like -128 and +127) to greater than 8 bits is not the difference of 
doubling your word size, if you were to encode this with Huffman.  it 
wouldn't matter.

so you would need an optimal (or, at least, decent) set of LPC 
coefficients (which depend on the spectrum of the data) and a code book 
for the Huffman decoding (which will depend on the PDF or 
frequency-of-occurance of the difference data).  if the LPC is done 
well, the difference signal should be white and small in amplitude and 
will likely be gaussian in PDF.

i remember talking with a guy named Claude Cellier who had this company 
called Pyramid Technologies or something, about lossless coding of 
audio.  dunno how the Apple lossless works, but the only way i can think 
of to losslessly compress a signal is take advantage of its 
non-whiteness (that's what LPC differencing does) and take advantage of 
its non-uniform PDF (that's what Huffman coding does).

if the data is white and uniform PDF, there ain't nothing you can do to 
losslessly compress it.  it's just like the data came outa a real good 
random number generator and there ain't shit you can do about it (but, 
back around 2002, there was a company called ZeoSync that claimed they 
could: 
http://news.cnet.com/The-quest-for-near-perfect-compression/2100-1023_3-839851.html 
and you can see that i wasn't so impressed - now the domain name is for 
sale).

-- 

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."

If you are happy with 50% saving (which is what 8 bit storing would
give), why not try LZW?

On 04.04.2014 02:19, julius wrote:
> On Thursday, April 3, 2014 4:58:59 PM UTC-4, Tim Wescott wrote:
> 
>> I think I could get substantial savings just by tracking when there's 
>> eight or fewer "full" bits in each word as opposed to 16, and saving 
>> that.  But it'd by no means be optimum.
>>
>> Another thought I had was to send 8-bit deltas, with the occasional 
>> vector of 16-bit words either when the rate of change is high, or so you 
>> didn't have to go back to the beginning of time to reconstruct.
>>
>> Somehow, I don't think that either method is best.
> 
> Are you focusing on lossless for now? If you are, a lot of times it's a 

matter of finding the combination of handful of schemes so that you can

address the rare but very bad cases that can blow your RT bit budget.
> 
> Look up Rice coding, and adaptive Rice coding.  And then look up LMS 

prediction-based methods.
> 
> Does your data tend to change exponentially or linearly or polynomially? 

This will help guide the kind of dynamic range handle you need,

then what kind of model-based compression you need.
>

On Thursday, April 3, 2014 4:58:59 PM UTC-4, Tim Wescott wrote:

> I think I could get substantial savings just by tracking when there's 
> eight or fewer "full" bits in each word as opposed to 16, and saving 
> that.  But it'd by no means be optimum.
> 
> Another thought I had was to send 8-bit deltas, with the occasional 
> vector of 16-bit words either when the rate of change is high, or so you 
> didn't have to go back to the beginning of time to reconstruct.
> 
> Somehow, I don't think that either method is best.

Are you focusing on lossless for now? If you are, a lot of times it's a matter of finding the combination of handful of schemes so that you can address the rare but very bad cases that can blow your RT bit budget. 

Look up Rice coding, and adaptive Rice coding.  And then look up LMS prediction-based methods. 

Does your data tend to change exponentially or linearly or polynomially? This will help guide the kind of dynamic range handle you need, then what kind of model-based compression you need.

Tim Wescott <tim@seemywebsite.really> wrote:

(snip)
>>> The data is mostly captured analog data, and it's often in the form of
>>> small-magnitude numbers, or numbers that vary from sample to sample by
>>> a small magnitude, with the occasional large excursion (like guard
>>> duty: long stretches of boredom punctuated by occasional bursts of
>>> terror).

(snip, then I wrote)
>> If you know the statistics pretty well, a fixed Huffman code would
>> probably do well. You need enough ROM for a look-up table.

> ROM I got.  I'm not sure that the Huffman would really help, 
> however I'm already considering the differential scheme.  
> I should look at some downloads, and see what I think of the Huffman.

Many compression systems work by finding common sequences of data
samples (bytes in most cases). If samples (in your case, after the
difference) are uncorrelated, that doens't help. 

Well, compression algorithms like LZW will still compress 
uncorellated, but non-uniform distributed data. Substrings
containing more common letters will occur more often.

But if the samples are uncorrelated, and each coded into an
integer number of bits, Huffman is a fine choice. If the statistics
don't change much, you can use static Huffman with a ROM lookup
table. That is what fax machines do. They code white pixels until
the next black, and black until the next white, using two different
lookup tables. (Most have more white than black.) 

Otherwise, you can compute the statistics for each data set
(you have to store it temporarily), generate an appropriate
Huffman code, write the code out, then the compressed data.
This needs more RAM, and less ROM, needs the table at the
beginning of each data set, but is better when the statistics
can change too much for static Huffman.

The arithmetic coder codes in non-integer number of bits per
sample, and so can be more efficient. Some time ago, I wrote
a JBIG2 coder, from the original specification, to generate
JBIG2 coded PDFs. Decoding is enough slower than LZW that you
can see the delay in reading pages in Adobe reader.  Well,
LZW codes byte at a time, where JBIG2 codes a bit at a time,
and with more computation for each bit.

-- glen

On Thu, 03 Apr 2014 19:53:57 +0000, glen herrmannsfeldt wrote:

> Tim Wescott <tim@seemywebsite.really> wrote:
> 
> (snip on need for compression)
> 
>> The data is mostly captured analog data, and it's often in the form of
>> small-magnitude numbers, or numbers that vary from sample to sample by
>> a small magnitude, with the occasional large excursion (like guard
>> duty: long stretches of boredom punctuated by occasional bursts of
>> terror).
> 
> So the first thing to do is generate a differential signal by
> subtracting each from the previous value. That gets a signal that is
> mostly small, but sometimes large. Then figure out how to compress that.
> The optimal answer depends on the statistics of the signal.
>  
>> Is there a worked out standard way of compressing such data, or should
>> I just roll my own?  I'd like it to be lossless, or lossy in a way that
>> I can fine-tune with parameters.
> 
> There are some commonly used for audio signals. ADPCM is lossy, probably
> too lossy. A-law and mu-law are commonly used for telephone audio (and
> not on differential data) which you might look at.
> 
> There are many lossless algorithms that you could use on the
> differential data, such as Huffman (adaptive or not) or an arithmetic
> coder.
> 
> If you know the statistics pretty well, a fixed Huffman code would
> probably do well. You need enough ROM for a look-up table.
> 
> -- glen

ROM I got.  I'm not sure that the Huffman would really help, however I'm 
already considering the differential scheme.  I should look at some 
downloads, and see what I think of the Huffman.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

On Thu, 03 Apr 2014 12:29:00 -0700, julius wrote:

> On Thursday, April 3, 2014 3:18:51 PM UTC-4, Tim Wescott wrote:
>> So I'm working on a product that logs a bunch of data to flash.
>> 
>> Due to physical space constraints (the board is something like 35 x
>> 45mm), I only have 2MB at my disposal, and if I were to log EVERYTHING
>> that's available in uncompressed form, I'd be challenging the amount of
>> space available.
>> 
>> The data is mostly captured analog data, and it's often in the form of
>> small-magnitude numbers, or numbers that vary from sample to sample by
>> a small magnitude, with the occasional large excursion (like guard
>> duty: long stretches of boredom punctuated by occasional bursts of
>> terror).
>> 
>> Is there a worked out standard way of compressing such data, or should
>> I just roll my own?  I'd like it to be lossless, or lossy in a way that
>> I can fine-tune with parameters.
>> 
>> TIA.
>> 
>> 
> As you know for lossless data the rate will depend highly on {exact
> nature of data, block length / memory size / real time delay
> constraints, and overall complexity}. If you want very the best
> compression rate, the best solutions are almost always based on trying a
> whole bunch of different compression schemes, and/or through waveform
> fitting + encode residual.
> 
> If you want lossless then it is highly highly highly highly dependent
> first on the error metric(s) you care about. Think about absolute error
> versus mean square error versus percentage error, etc. You should also
> look at companding strategies, either fixed or otherwise. Second it
> highly depends on what kind of artifacts you can tolerate, and of course
> this is a related question.
> 
> If you want to tune, it usually depends on how much complexity you can
> spare. This leads to either block-based schemes that you iteratively
> tune (pragmatically meaning, re-encode), or sample-by-sample.
> 
> Now somebody else more daring than me will propose their favorite
> algorithm.
> 
> I'm sure I gave the answer you didn't want to hear, but you were
> (understandably) sparse with detail and you'll have no shortage of
> suggestions to try A or B or C favorite methods ;-).
> 
> Julius

It's hard to know just what detail to share, there's so much of it!

I think I could get substantial savings just by tracking when there's 
eight or fewer "full" bits in each word as opposed to 16, and saving 
that.  But it'd by no means be optimum.

Another thought I had was to send 8-bit deltas, with the occasional 
vector of 16-bit words either when the rate of change is high, or so you 
didn't have to go back to the beginning of time to reconstruct.

Somehow, I don't think that either method is best.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Tim Wescott <tim@seemywebsite.really> wrote:

(snip on need for compression)

> The data is mostly captured analog data, and it's often in the form of 
> small-magnitude numbers, or numbers that vary from sample to sample by a 
> small magnitude, with the occasional large excursion (like guard duty: 
> long stretches of boredom punctuated by occasional bursts of terror).

So the first thing to do is generate a differential signal by
subtracting each from the previous value. That gets a signal that
is mostly small, but sometimes large. Then figure out how to 
compress that. The optimal answer depends on the statistics
of the signal. 

> Is there a worked out standard way of compressing such data, or should I 
> just roll my own?  I'd like it to be lossless, or lossy in a way that I 
> can fine-tune with parameters.

There are some commonly used for audio signals. ADPCM is lossy, 
probably too lossy. A-law and mu-law are commonly used for telephone
audio (and not on differential data) which you might look at.

There are many lossless algorithms that you could use on the
differential data, such as Huffman (adaptive or not) or an
arithmetic coder.

If you know the statistics pretty well, a fixed Huffman code
would probably do well. You need enough ROM for a look-up table.

-- glen

On Thursday, April 3, 2014 3:18:51 PM UTC-4, Tim Wescott wrote:
> So I'm working on a product that logs a bunch of data to flash.
> 
> Due to physical space constraints (the board is something like 35 x 
> 45mm), I only have 2MB at my disposal, and if I were to log EVERYTHING 
> that's available in uncompressed form, I'd be challenging the amount of 
> space available.
> 
> The data is mostly captured analog data, and it's often in the form of 
> small-magnitude numbers, or numbers that vary from sample to sample by a 
> small magnitude, with the occasional large excursion (like guard duty: 
> long stretches of boredom punctuated by occasional bursts of terror).
> 
> Is there a worked out standard way of compressing such data, or should I 
> just roll my own?  I'd like it to be lossless, or lossy in a way that I 
> can fine-tune with parameters.
> 
> TIA.
> 

As you know for lossless data the rate will depend highly on {exact nature of data, block length / memory size / real time delay constraints, and overall complexity}. If you want very the best compression rate, the best solutions are almost always based on trying a whole bunch of different compression schemes, and/or through waveform fitting + encode residual. 

If you want lossless then it is highly highly highly highly dependent first on the error metric(s) you care about. Think about absolute error versus mean square error versus percentage error, etc. You should also look at companding strategies, either fixed or otherwise. Second it highly depends on what kind of artifacts you can tolerate, and of course this is a related question. 

If you want to tune, it usually depends on how much complexity you can spare. This leads to either block-based schemes that you iteratively tune (pragmatically meaning, re-encode), or sample-by-sample. 

Now somebody else more daring than me will propose their favorite algorithm. 

I'm sure I gave the answer you didn't want to hear, but you were (understandably) sparse with detail and you'll have no shortage of suggestions to try A or B or C favorite methods ;-). 

Julius

So I'm working on a product that logs a bunch of data to flash.

Due to physical space constraints (the board is something like 35 x 
45mm), I only have 2MB at my disposal, and if I were to log EVERYTHING 
that's available in uncompressed form, I'd be challenging the amount of 
space available.

The data is mostly captured analog data, and it's often in the form of 
small-magnitude numbers, or numbers that vary from sample to sample by a 
small magnitude, with the occasional large excursion (like guard duty: 
long stretches of boredom punctuated by occasional bursts of terror).

Is there a worked out standard way of compressing such data, or should I 
just roll my own?  I'd like it to be lossless, or lossy in a way that I 
can fine-tune with parameters.

TIA.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com