comp.dsp | How do I compress an array of floating numbers in Matlab?| page 2

Reply by Luna Moon ●April 4, 20102010-04-04

On Apr 3, 8:29&#4294967295;pm, "Mark Shore" <msh...@magmageosciences.ca> wrote:
> Luna Moon <lunamoonm...@gmail.com> wrote in message <205a603e-cc38-4088-8d39-5d5b8464a...@d34g2000vbl.googlegroups.com>...
> > Hi all,
>
> > I have a vector of real numbers in Matlab. How do I compress them? &#4294967295;Of
> > course this has to be lossless, since I need to be able to recover
> > them.
>
> > The goal is to study the Shannon rate and entropy of these real
> > numbers, so I decide to compress them and see how much compression
> > ratio I can have.
>
> > I don't need to write the result into compressed files, so those
> > headers, etc. are just overhead for me which affect me calculating the
> > Entropy... so I just need a bare version of the compress ratio...
>
> > Any pointers?
>
> > Thanks a lot!
>
> An exceeding simple test involving little or no effort on your part would be to take representative binary files and compress them with off-the-shelf utilities such as WinZip or 7-Zip.
>
> This would certainly give you some idea of what level of lossless compression you can expect from reasonably well-tested and mature algorithms before you try to adapt your own.


Thanks a lot folks.

Please remember the goal is not to compress the floating numbers per
se. It's actually to measure the entropy of the data.

I don't really care how much compression it can maximally achieve.

Using WinZip is a great idea, however, I am looking for

(1) a command inside Matlab;
(2) a bare-bone compression, without the header info, etc. in Winzip,
because those are overheads in terms of measuring entropy...

Any more thoughts?

Thank you!

Reply by robert bristow-johnson ●April 4, 20102010-04-04

On Apr 4, 11:27&#4294967295;am, Luna Moon <lunamoonm...@gmail.com> wrote:
> On Apr 3, 8:29&#4294967295;pm, "Mark Shore" <msh...@magmageosciences.ca> wrote:
>
>
>
> > Luna Moon <lunamoonm...@gmail.com> wrote in message <205a603e-cc38-4088-8d39-5d5b8464a...@d34g2000vbl.googlegroups.com>...
> > > Hi all,
>
> > > I have a vector of real numbers in Matlab. How do I compress them? &#4294967295;Of
> > > course this has to be lossless, since I need to be able to recover
> > > them.
>
> > > The goal is to study the Shannon rate and entropy of these real
> > > numbers, so I decide to compress them and see how much compression
> > > ratio I can have.
>
> > > I don't need to write the result into compressed files, so those
> > > headers, etc. are just overhead for me which affect me calculating the
> > > Entropy... so I just need a bare version of the compress ratio...
>
> > > Any pointers?
>
> > > Thanks a lot!
>
> > An exceeding simple test involving little or no effort on your part would be to take representative binary files and compress them with off-the-shelf utilities such as WinZip or 7-Zip.
>
> > This would certainly give you some idea of what level of lossless compression you can expect from reasonably well-tested and mature algorithms before you try to adapt your own.
>
> Thanks a lot folks.
>
> Please remember the goal is not to compress the floating numbers per
> se. It's actually to measure the entropy of the data.
>
> I don't really care how much compression it can maximally achieve.
>
> Using WinZip is a great idea, however, I am looking for
>
> (1) a command inside Matlab;
> (2) a bare-bone compression, without the header info, etc. in Winzip,
> because those are overheads in terms of measuring entropy...
>
> Any more thoughts?

the entropy is the mean number of bits of information (bits as in
information theory) of the messages.   certainly the least significant
bits of the words will be nearly completely random.  let's say each
word is N bits (N is likely 32 or 64) and you decide that the bottom M
bits is completely random crap (it might not be, if you have a lot of
simple fractions like 1/2 and 1/4 etc).

so make a histogram.  ignoring the least significant M bits (you'll
have to decide what M is) count the number of occurrences of every
sample value.  for a specific value V, the information content of the
top N-M bits is -log2(p(V)) where p(v) is the probability of any
sample taking on value V.  you get that by "frequency of occurrence",
divide the number of occurrence by the total number of occurrences.
the entropy is the mean information content:

    SUM{ p(V) * (-log2(p(V)) }  +  M
     V

you can try setting M to zero, but you'll have a helluva lot of bins
in your histogram.

in fact, what i would do is compute this entropy for a variety of
different values for M (that your memory allows for) and see if the
entropy changes much.

if you have a lot of simple fractions in your mantissa, then i was
wrong in assuming you can assume the lower M bits are fully randomly
scrambled.  but you don't practically have 2^32 number of bins for
your histogram, so something must be done.  maybe split the upper and
lower portions of the floating-point word into two messages (that we
hope are independent) and run a histogram on both.

i have never done bit-masking in MATLAB.  you'll have to figger out
how to do that.

r b-j

Reply by Jerry Avins ●April 4, 20102010-04-04

On 4/4/2010 12:52 AM, robert bristow-johnson wrote:
> On Apr 3, 7:13 pm, John<sampson...@gmail.com>  wrote:
>
>> Nobody in here has a sense of humor
>
> i got it.  your file compression scheme sorta rearranged the order of
> data.
>
> but yer right.  no sense of humor to be found here.

Didn't it strike you as odd than any file of floats, however long, could 
be _reversibly_ compressed to two integers?

Jerry
-- 
"It does me no injury for my neighbor to say there are 20 gods, or no
God. It neither picks my pocket nor breaks my leg."
          Thomas Jefferson to the Virginia House of Delegates in 1776.
&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;

Reply by Jerry Avins ●April 4, 20102010-04-04

On 4/4/2010 1:51 AM, Vladimir Vassilevsky wrote:
>
>
> robert bristow-johnson wrote:
>
>> On Apr 3, 7:13 pm, John <sampson...@gmail.com> wrote:
>>
>>
>>> Nobody in here has a sense of humor
>>
>>
>> i got it. your file compression scheme sorta rearranged the order of
>> data.
>>
>> but yer right. no sense of humor to be found here.
>
> JFYI:
>
> http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform

I didn't spend enough time on it to see how the input is recovered,

Jerry
-- 
"It does me no injury for my neighbor to say there are 20 gods, or no
God. It neither picks my pocket nor breaks my leg."
          Thomas Jefferson to the Virginia House of Delegates in 1776.
&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;

Reply by glen herrmannsfeldt ●April 4, 20102010-04-04

In comp.dsp Luna Moon <lunamoonmoon@gmail.com> wrote:
(snip)

> Please remember the goal is not to compress the floating numbers per
> se. It's actually to measure the entropy of the data.

> I don't really care how much compression it can maximally achieve.

(snip)

If you can find the (low) entropy then you can compress the data.
The hard part, usually, is finding it.  For an array of floating
point numbers it seems, most likely, that you would find it in
terms or repititions.  That is, other places in the file with exactly
the same value.  Other than that, it will be hard to find unless
you know the source.

Say, for example, you have a file of sin(n) (in radians) for integer n
from zero to (some large number).  Now, that has fairly low entropy
with the assumption that you have a good sin() routine available, but
it will be difficult for a program that doesn't know that the file
is likely to have sin(n) in it to find it.

If someone tries a Fourier transform on the data then they might
discover the pattern.  As the result might not be exact, one would
code an approximation and then list the (must smaller) difference
between the two data sets.  

Continuing, the output of a linear-congruential random number
generator is also easy to predict if you know the constants of 
the generator.  If you don't, and you have a big enough sample,
then you can likely find the pattern.  (If you have the bits
exactly, though I am not sure how long it would take.)

If you have, say, sin() of the linear-congruential number
stream then it is likely much more difficult.  

-- glen

Reply by glen herrmannsfeldt ●April 4, 20102010-04-04

Jerry Avins <jya@ieee.org> wrote:
(snip)

>> http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform

> I didn't spend enough time on it to see how the input is recovered,

Page has a pretty easy to follow explanation of the potential to
recover the input.  It isn't so obvious from that, though, that
it could be done fast.

The quick explanation (but not necessarily implementation) of the
forward transform is to take all cyclic permutations of the input
string, sort the resulting list, and output the last character of each.

It is easy to see that to get the first character of such sorted list,
you only need to sort the list of last characters.  If you keep track
of the last character while sorting, you now have character pairs.
Repeat the process of prepending the characters and sorting until
you have the complete sorted list of cyclic permutations.  The only
additional thing you need is to know where in the list the first
character of the input stream was.  In the wiki articly they use a
special character for that, but I believe in the actual transform
they just remember where it was.  

That shows that the transform can be reversed, not necessarily
the fastest way to do it.

-- glen

Reply by robert bristow-johnson ●April 4, 20102010-04-04

On Apr 4, 1:58&#4294967295;pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
...
> Continuing, the output of a linear-congruential random number
> generator is also easy to predict if you know the constants of
> the generator.

yeah, i guess you need a couple of constants and the initial seed
value.  but don't you also need to somehow encode the rng algorithm,
too?

> &#4294967295;If you don't, and you have a big enough sample,
> then you can likely find the pattern. &#4294967295;(If you have the bits
> exactly, though I am not sure how long it would take.)
>
> If you have, say, sin() of the linear-congruential number
> stream then it is likely much more difficult. &#4294967295;

it will look different in a histogram.  suppose the rng was scaled to
be uniformly distributed over a segment as long as any multiple of
2pi, then the p.d.f. would go up as it approaches +1 or -1.

r b-j

Reply by Jerry Avins ●April 4, 20102010-04-04

On 4/4/2010 2:42 PM, glen herrmannsfeldt wrote:

   ...

> That shows that the transform can be reversed, not necessarily
> the fastest way to do it.

Great! Too clever by half, though. No wonder it's fairly new.

Jerry
-- 
"It does me no injury for my neighbor to say there are 20 gods, or no
God. It neither picks my pocket nor breaks my leg."
          Thomas Jefferson to the Virginia House of Delegates in 1776.
&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;

Reply by glen herrmannsfeldt ●April 5, 20102010-04-05

In comp.dsp robert bristow-johnson <rbj@audioimagination.com> wrote:
> On Apr 4, 1:58?pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:

>> Continuing, the output of a linear-congruential random number
>> generator is also easy to predict if you know the constants of
>> the generator.

> yeah, i guess you need a couple of constants and the initial seed
> value.  but don't you also need to somehow encode the rng algorithm,
> too?

Well, linear congruential pretty much means multiply by
a constant, add a constant (possibly zero) and modulo a constant.
I am not actually sure how long it takes, given a sufficiently
long sample of the output, to find the constants.

>> ?If you don't, and you have a big enough sample,
>> then you can likely find the pattern. ?(If you have the bits
>> exactly, though I am not sure how long it would take.)

>> If you have, say, sin() of the linear-congruential number
>> stream then it is likely much more difficult. ?

> it will look different in a histogram.  suppose the rng was scaled to
> be uniformly distributed over a segment as long as any multiple of
> 2pi, then the p.d.f. would go up as it approaches +1 or -1.

Yes you could do that.  But assuming that you have the ability
to find the constants for an LCG from the output, it is much
harder if you don't have all the bits of the generator output.
If, for example, you have the single precision sine then you 
likely don't have enough bits after taking the arcsine.  

-- glen

Reply by robert bristow-johnson ●April 5, 20102010-04-05

On Apr 5, 12:39&#4294967295;am, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
> In comp.dsp robert bristow-johnson <r...@audioimagination.com> wrote:
>
> > On Apr 4, 1:58?pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
> >> Continuing, the output of a linear-congruential random number
> >> generator is also easy to predict if you know the constants of
> >> the generator.
> > yeah, i guess you need a couple of constants and the initial seed
> > value. &#4294967295;but don't you also need to somehow encode the rng algorithm,
> > too?
>
> Well, linear congruential pretty much means multiply by
> a constant, add a constant (possibly zero) and modulo a constant.

right.  maybe it's a dumb point, but there are other pseudo-r.n.g.
algs (that don't necessarily produce good r.n.), and there are
zillions of different permutations.  it just seems to me that a
complete encoding might include information for how the r.n.g. alg
works, besides any seed numbers.  (sorta like a code book, what Luna
doesn't want to see in a header, but what i think sorta belongs.)

> I am not actually sure how long it takes, given a sufficiently
> long sample of the output, to find the constants.
>

i might think it would be a bitch.  especially with a weird modulo.

> >> ?If you don't, and you have a big enough sample,
> >> then you can likely find the pattern. ?(If you have the bits
> >> exactly, though I am not sure how long it would take.)
> >> If you have, say, sin() of the linear-congruential number
> >> stream then it is likely much more difficult. ?
> > it will look different in a histogram. &#4294967295;suppose the rng was scaled to
> > be uniformly distributed over a segment as long as any multiple of
> > 2pi, then the p.d.f. would go up as it approaches +1 or -1.
>
> Yes you could do that. &#4294967295;But assuming that you have the ability
> to find the constants for an LCG from the output,

are you assuming that?

> it is much
> harder if you don't have all the bits of the generator output.

i just think it would be very hard, in nearly any case.

> If, for example, you have the single precision sine then you
> likely don't have enough bits after taking the arcsine. &#4294967295;

with round-off, is this sine mapping one-to-one?  if not, then the
arcsine won't be able to undo to it losslessly.

r b-j

Previous 123 Next

How do I compress an array of floating numbers in Matlab?

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group