DSPRelated.com
Forums

Data compression that works down to DC

Started by Tim Wescott April 3, 2014
So I'm working on a product that logs a bunch of data to flash.

Due to physical space constraints (the board is something like 35 x 
45mm), I only have 2MB at my disposal, and if I were to log EVERYTHING 
that's available in uncompressed form, I'd be challenging the amount of 
space available.

The data is mostly captured analog data, and it's often in the form of 
small-magnitude numbers, or numbers that vary from sample to sample by a 
small magnitude, with the occasional large excursion (like guard duty: 
long stretches of boredom punctuated by occasional bursts of terror).

Is there a worked out standard way of compressing such data, or should I 
just roll my own?  I'd like it to be lossless, or lossy in a way that I 
can fine-tune with parameters.

TIA.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

On Thursday, April 3, 2014 3:18:51 PM UTC-4, Tim Wescott wrote:
> So I'm working on a product that logs a bunch of data to flash. > > Due to physical space constraints (the board is something like 35 x > 45mm), I only have 2MB at my disposal, and if I were to log EVERYTHING > that's available in uncompressed form, I'd be challenging the amount of > space available. > > The data is mostly captured analog data, and it's often in the form of > small-magnitude numbers, or numbers that vary from sample to sample by a > small magnitude, with the occasional large excursion (like guard duty: > long stretches of boredom punctuated by occasional bursts of terror). > > Is there a worked out standard way of compressing such data, or should I > just roll my own? I'd like it to be lossless, or lossy in a way that I > can fine-tune with parameters. > > TIA. >
As you know for lossless data the rate will depend highly on {exact nature of data, block length / memory size / real time delay constraints, and overall complexity}. If you want very the best compression rate, the best solutions are almost always based on trying a whole bunch of different compression schemes, and/or through waveform fitting + encode residual. If you want lossless then it is highly highly highly highly dependent first on the error metric(s) you care about. Think about absolute error versus mean square error versus percentage error, etc. You should also look at companding strategies, either fixed or otherwise. Second it highly depends on what kind of artifacts you can tolerate, and of course this is a related question. If you want to tune, it usually depends on how much complexity you can spare. This leads to either block-based schemes that you iteratively tune (pragmatically meaning, re-encode), or sample-by-sample. Now somebody else more daring than me will propose their favorite algorithm. I'm sure I gave the answer you didn't want to hear, but you were (understandably) sparse with detail and you'll have no shortage of suggestions to try A or B or C favorite methods ;-). Julius
Tim Wescott <tim@seemywebsite.really> wrote:

(snip on need for compression)

> The data is mostly captured analog data, and it's often in the form of > small-magnitude numbers, or numbers that vary from sample to sample by a > small magnitude, with the occasional large excursion (like guard duty: > long stretches of boredom punctuated by occasional bursts of terror).
So the first thing to do is generate a differential signal by subtracting each from the previous value. That gets a signal that is mostly small, but sometimes large. Then figure out how to compress that. The optimal answer depends on the statistics of the signal.
> Is there a worked out standard way of compressing such data, or should I > just roll my own? I'd like it to be lossless, or lossy in a way that I > can fine-tune with parameters.
There are some commonly used for audio signals. ADPCM is lossy, probably too lossy. A-law and mu-law are commonly used for telephone audio (and not on differential data) which you might look at. There are many lossless algorithms that you could use on the differential data, such as Huffman (adaptive or not) or an arithmetic coder. If you know the statistics pretty well, a fixed Huffman code would probably do well. You need enough ROM for a look-up table. -- glen
On Thu, 03 Apr 2014 12:29:00 -0700, julius wrote:

> On Thursday, April 3, 2014 3:18:51 PM UTC-4, Tim Wescott wrote: >> So I'm working on a product that logs a bunch of data to flash. >> >> Due to physical space constraints (the board is something like 35 x >> 45mm), I only have 2MB at my disposal, and if I were to log EVERYTHING >> that's available in uncompressed form, I'd be challenging the amount of >> space available. >> >> The data is mostly captured analog data, and it's often in the form of >> small-magnitude numbers, or numbers that vary from sample to sample by >> a small magnitude, with the occasional large excursion (like guard >> duty: long stretches of boredom punctuated by occasional bursts of >> terror). >> >> Is there a worked out standard way of compressing such data, or should >> I just roll my own? I'd like it to be lossless, or lossy in a way that >> I can fine-tune with parameters. >> >> TIA. >> >> > As you know for lossless data the rate will depend highly on {exact > nature of data, block length / memory size / real time delay > constraints, and overall complexity}. If you want very the best > compression rate, the best solutions are almost always based on trying a > whole bunch of different compression schemes, and/or through waveform > fitting + encode residual. > > If you want lossless then it is highly highly highly highly dependent > first on the error metric(s) you care about. Think about absolute error > versus mean square error versus percentage error, etc. You should also > look at companding strategies, either fixed or otherwise. Second it > highly depends on what kind of artifacts you can tolerate, and of course > this is a related question. > > If you want to tune, it usually depends on how much complexity you can > spare. This leads to either block-based schemes that you iteratively > tune (pragmatically meaning, re-encode), or sample-by-sample. > > Now somebody else more daring than me will propose their favorite > algorithm. > > I'm sure I gave the answer you didn't want to hear, but you were > (understandably) sparse with detail and you'll have no shortage of > suggestions to try A or B or C favorite methods ;-). > > Julius
It's hard to know just what detail to share, there's so much of it! I think I could get substantial savings just by tracking when there's eight or fewer "full" bits in each word as opposed to 16, and saving that. But it'd by no means be optimum. Another thought I had was to send 8-bit deltas, with the occasional vector of 16-bit words either when the rate of change is high, or so you didn't have to go back to the beginning of time to reconstruct. Somehow, I don't think that either method is best. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
On Thu, 03 Apr 2014 19:53:57 +0000, glen herrmannsfeldt wrote:

> Tim Wescott <tim@seemywebsite.really> wrote: > > (snip on need for compression) > >> The data is mostly captured analog data, and it's often in the form of >> small-magnitude numbers, or numbers that vary from sample to sample by >> a small magnitude, with the occasional large excursion (like guard >> duty: long stretches of boredom punctuated by occasional bursts of >> terror). > > So the first thing to do is generate a differential signal by > subtracting each from the previous value. That gets a signal that is > mostly small, but sometimes large. Then figure out how to compress that. > The optimal answer depends on the statistics of the signal. > >> Is there a worked out standard way of compressing such data, or should >> I just roll my own? I'd like it to be lossless, or lossy in a way that >> I can fine-tune with parameters. > > There are some commonly used for audio signals. ADPCM is lossy, probably > too lossy. A-law and mu-law are commonly used for telephone audio (and > not on differential data) which you might look at. > > There are many lossless algorithms that you could use on the > differential data, such as Huffman (adaptive or not) or an arithmetic > coder. > > If you know the statistics pretty well, a fixed Huffman code would > probably do well. You need enough ROM for a look-up table. > > -- glen
ROM I got. I'm not sure that the Huffman would really help, however I'm already considering the differential scheme. I should look at some downloads, and see what I think of the Huffman. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
Tim Wescott <tim@seemywebsite.really> wrote:

(snip)
>>> The data is mostly captured analog data, and it's often in the form of >>> small-magnitude numbers, or numbers that vary from sample to sample by >>> a small magnitude, with the occasional large excursion (like guard >>> duty: long stretches of boredom punctuated by occasional bursts of >>> terror).
(snip, then I wrote)
>> If you know the statistics pretty well, a fixed Huffman code would >> probably do well. You need enough ROM for a look-up table.
> ROM I got. I'm not sure that the Huffman would really help, > however I'm already considering the differential scheme. > I should look at some downloads, and see what I think of the Huffman.
Many compression systems work by finding common sequences of data samples (bytes in most cases). If samples (in your case, after the difference) are uncorrelated, that doens't help. Well, compression algorithms like LZW will still compress uncorellated, but non-uniform distributed data. Substrings containing more common letters will occur more often. But if the samples are uncorrelated, and each coded into an integer number of bits, Huffman is a fine choice. If the statistics don't change much, you can use static Huffman with a ROM lookup table. That is what fax machines do. They code white pixels until the next black, and black until the next white, using two different lookup tables. (Most have more white than black.) Otherwise, you can compute the statistics for each data set (you have to store it temporarily), generate an appropriate Huffman code, write the code out, then the compressed data. This needs more RAM, and less ROM, needs the table at the beginning of each data set, but is better when the statistics can change too much for static Huffman. The arithmetic coder codes in non-integer number of bits per sample, and so can be more efficient. Some time ago, I wrote a JBIG2 coder, from the original specification, to generate JBIG2 coded PDFs. Decoding is enough slower than LZW that you can see the delay in reading pages in Adobe reader. Well, LZW codes byte at a time, where JBIG2 codes a bit at a time, and with more computation for each bit. -- glen
On Thursday, April 3, 2014 4:58:59 PM UTC-4, Tim Wescott wrote:

> I think I could get substantial savings just by tracking when there's > eight or fewer "full" bits in each word as opposed to 16, and saving > that. But it'd by no means be optimum. > > Another thought I had was to send 8-bit deltas, with the occasional > vector of 16-bit words either when the rate of change is high, or so you > didn't have to go back to the beginning of time to reconstruct. > > Somehow, I don't think that either method is best.
Are you focusing on lossless for now? If you are, a lot of times it's a matter of finding the combination of handful of schemes so that you can address the rare but very bad cases that can blow your RT bit budget. Look up Rice coding, and adaptive Rice coding. And then look up LMS prediction-based methods. Does your data tend to change exponentially or linearly or polynomially? This will help guide the kind of dynamic range handle you need, then what kind of model-based compression you need.
If you are happy with 50% saving (which is what 8 bit storing would
give), why not try LZW?

On 04.04.2014 02:19, julius wrote:
> On Thursday, April 3, 2014 4:58:59 PM UTC-4, Tim Wescott wrote: > >> I think I could get substantial savings just by tracking when there's >> eight or fewer "full" bits in each word as opposed to 16, and saving >> that. But it'd by no means be optimum. >> >> Another thought I had was to send 8-bit deltas, with the occasional >> vector of 16-bit words either when the rate of change is high, or so you >> didn't have to go back to the beginning of time to reconstruct. >> >> Somehow, I don't think that either method is best. > > Are you focusing on lossless for now? If you are, a lot of times it's a
matter of finding the combination of handful of schemes so that you can address the rare but very bad cases that can blow your RT bit budget.
> > Look up Rice coding, and adaptive Rice coding. And then look up LMS
prediction-based methods.
> > Does your data tend to change exponentially or linearly or polynomially?
This will help guide the kind of dynamic range handle you need, then what kind of model-based compression you need.
>
On 4/3/14 5:00 PM, Tim Wescott wrote:
> On Thu, 03 Apr 2014 19:53:57 +0000, glen herrmannsfeldt wrote: > >> Tim Wescott<tim@seemywebsite.really> wrote: >> >> (snip on need for compression) >> >>> The data is mostly captured analog data, and it's often in the form of >>> small-magnitude numbers, or numbers that vary from sample to sample by >>> a small magnitude, with the occasional large excursion (like guard >>> duty: long stretches of boredom punctuated by occasional bursts of >>> terror). >> >> So the first thing to do is generate a differential signal by >> subtracting each from the previous value. That gets a signal that is >> mostly small, but sometimes large. Then figure out how to compress that. >> The optimal answer depends on the statistics of the signal. >>
...
> I'm not sure that the Huffman would really help, however I'm > already considering the differential scheme. I should look at some > downloads, and see what I think of the Huffman. >
Tim and Glen and Julius et al. if you don't mind bit twiddling that you do with Huffman, simply first do the differential encoding (and you can generalize that with LPC, so you use LPC to *guess* at the next sample value and you encode the difference between your guess and what the sample value really is) and then do Huffman coding on the result. many of the difference samples will be zero or +1 or -1 or +2 or -2, etc. you will not need 8 bits to encode them. and the difference between going from 8-bit differences (like -128 and +127) to greater than 8 bits is not the difference of doubling your word size, if you were to encode this with Huffman. it wouldn't matter. so you would need an optimal (or, at least, decent) set of LPC coefficients (which depend on the spectrum of the data) and a code book for the Huffman decoding (which will depend on the PDF or frequency-of-occurance of the difference data). if the LPC is done well, the difference signal should be white and small in amplitude and will likely be gaussian in PDF. i remember talking with a guy named Claude Cellier who had this company called Pyramid Technologies or something, about lossless coding of audio. dunno how the Apple lossless works, but the only way i can think of to losslessly compress a signal is take advantage of its non-whiteness (that's what LPC differencing does) and take advantage of its non-uniform PDF (that's what Huffman coding does). if the data is white and uniform PDF, there ain't nothing you can do to losslessly compress it. it's just like the data came outa a real good random number generator and there ain't shit you can do about it (but, back around 2002, there was a company called ZeoSync that claimed they could: http://news.cnet.com/The-quest-for-near-perfect-compression/2100-1023_3-839851.html and you can see that i wasn't so impressed - now the domain name is for sale). -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
robert bristow-johnson <rbj@audioimagination.com> wrote:
> On 4/3/14 5:00 PM, Tim Wescott wrote:
(snip)
>> I'm not sure that the Huffman would really help, however I'm >> already considering the differential scheme. I should look at some >> downloads, and see what I think of the Huffman.
(snip)
> if you don't mind bit twiddling that you do with Huffman, simply first > do the differential encoding (and you can generalize that with LPC, so > you use LPC to *guess* at the next sample value and you encode the > difference between your guess and what the sample value really is) and > then do Huffman coding on the result. many of the difference samples > will be zero or +1 or -1 or +2 or -2, etc. you will not need 8 bits to > encode them. and the difference between going from 8-bit differences > (like -128 and +127) to greater than 8 bits is not the difference of > doubling your word size, if you were to encode this with Huffman. it > wouldn't matter.
(snip)
> if the data is white and uniform PDF, there ain't nothing you can do to > losslessly compress it. it's just like the data came outa a real good > random number generator and there ain't shit you can do about it
There is one more requirement, that the data fill up the word. A file of random, uniform PDF ASCII letters from A to Z is somewhat compressable, by a factor of log(26)/log(256) (if stored in 8 bit bytes). LZW does reasonably well at doing that, as repeated substrings are proportionaly more probable. Huffman will code each letter in 4 or 5 bits, so again, not so far off. Coding each letter with the appropriate fractional bits would do slightly better. -- glen