DSPRelated.com
Forums

Building Peak Files

Started by seij...@gmail.com March 18, 2005
I'm a complete newbie to audio processing but I'm working on a project
that requires some audio imaging.  The audio files will typically be 30
minute files that average about 256mb.  I can graph the whole thing
just file but it's slow and uses too much memory.

So, I'm hoping to learn how to use peak files to speed up display
times.  I'm not looking for very detailed information (although that
would be wonderful), I'm looking for some pointers on what I need to be
watching for and doing.  I know the general idea of just graphing the
max value of a   My main problem is what to do when the user zooms in
or out of the audio file.

Can anyone give me some pointers, links or book suggestions?  I'd
really appreciate the help.

in article 1111120728.729655.39800@o13g2000cwo.googlegroups.com,
seijin@gmail.com at seijin@gmail.com wrote on 03/17/2005 23:38:

> So, I'm hoping to learn how to use peak files to speed up display > times. I'm not looking for very detailed information (although that > would be wonderful), I'm looking for some pointers on what I need to be > watching for and doing. I know the general idea of just graphing the > max value of a My main problem is what to do when the user zooms in > or out of the audio file.
what do you mean by "peak files"? there's some product line by a company calle "Peak Audio" and you don't mean anything related to them, do you? (it all looks hardware anyway.) if you mean having a separate file that has the data needed to display the audio (in a compressed time-scale), the concept is pretty simple. you want to pick a nominal downsampling (zoom) ratio (your display might downsample more than that). let's say it's 128 to 1, just for shits&grins. that would reduce your 256 meg to 4 meg for the "peak file". that means, for the first analysis of the audio file, you find the maximum value (the most positive or least negative) and the minimum (most negative of least positive) value for each segment of 128 samples and you store those 2 values for each segment of 128 samples in this "peak file". if you're zooming in more than that (where 1 pixel is good for something less than 128 samples) you need only to reanalyze that one segment of the audio, but it's not such a large segment. if you're zooming _out_ from that ratio of 128 to 1, then limit your zoom out ratio to be a multiple of 128 and apply the max and min operations to your "peak file" that was done at a ratio of 128. then you don't have to reanalyze the whole damn 30 minute audio file. make sense? -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
Actually, it does make sense.  But if zooming in, I would have to
reanalyze the whole file, wouldn't I?  Because the user would want the
ability to scroll through the whole file at a fine detail.  So maybe
they zoom in far enough where 1 pixel is equal to 50 samples - wouldn't
I need to reanalyze the whole file by grabbing 50 samples, finding the
min & max and then plotting?

I can see that zooming out wouldn't be a problem since you're just
pretending that 128 samples is really 1 sample.  So they zoom out once
and now 256 samples is equal to 1 sample.  And instead of re-reading
the whole file to get the min and max of 256 samples you'd just get the
minimum and maximum of the first two "blocks" of 128 samples, right?
And then it should be fine zooming back in as long as they only zoom
into a detail of 128 samples/pixel as that should be loaded into memory
at that detail.

Am I on the same level?

in article 1111450446.938466.45180@f14g2000cwb.googlegroups.com,
seijin@gmail.com at seijin@gmail.com wrote on 03/21/2005 19:14:

> Actually, it does make sense.
...
> I can see that zooming out wouldn't be a problem since you're just > pretending that 128 samples is really 1 sample. So they zoom out once > and now 256 samples is equal to 1 sample. And instead of re-reading > the whole file to get the min and max of 256 samples you'd just get the > minimum and maximum of the first two "blocks" of 128 samples, right?
exactly. and as long as your wider zoom ratio is a multiple of 128 samples per pixel, you need not look at the audio file at all. just get your min and max from the "peak file".
> And then it should be fine zooming back in as long as they only zoom > into a detail of 128 samples/pixel as that should be loaded into memory > at that detail.
yeah, i guess 4 meg (for 30 minutes of audio) isn't too bad to load into memory.
> Am I on the same level?
i think so. ...
> But if zooming in, I would have to reanalyze the whole file, wouldn't I?
no, i don't so.
> Because the user would want the > ability to scroll through the whole file at a fine detail. So maybe > they zoom in far enough where 1 pixel is equal to 50 samples - wouldn't > I need to reanalyze the whole file by grabbing 50 samples, finding the > min & max and then plotting?
but i don't see why you think you need to do that for the *whole* audio file. as the user presses the scroll left or scroll right arrows, the display is moved some amount to the right or left (respectively) with some of it "falling off the edge" and there is this hole in the display you have to fill in. only the audio for that hole needs be analyzed for a min and max per pixel. not the whole audio file. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
CoolEdit uses this general technique, and they call them peak files (.pk
extension), so that is probably where the terminology came from (not Peak Audio
who make CobraNet).

It sounds like you guys already have this pretty well nailed down, but to
summarize, the basic idea is that the peak file is a down-sampled version of the
original audio, but the downsampling process takes the maximum absolute value of
each block of say 128 samples, rather than using traditional
filtering/decimation process.  The peak file is used to draw the waveform when
you are zoomed out, and when you are zoomed in, you only need a small portion of
the audio file, so you use the actual audio data.

The peak file is created either during recording, or if you are opening an
existing file, when the file is opened.  CoolEdit will save the peak files along
with the original .wav file so that on subsequent opens, it need not be
recalculated.  They must somehow deal with the problem of keeping the peak file
in sync with the wave file when you copy/paste, change volume, etc..  A brute
force method would be to simplify rescan the whole file after every change and
recreate the peak file anew--simple, fool-proof, but very inefficient especially
with large files.  Smarter algorithms that only update the portion affected are
probably used, and the rescanning is probably build into the processing routines
so the audio data only needs to be accessed once.

One other subtlety to mention, there are 2 different ways I've seen of dealing
with the bi-polar nature of audio.  You could store a single maximum absolute
value for each block, and then display the waveform as either a simple
positive-only envelope or as a symmetrical bi-polar waveform with the absolute
maximum value used for both the positive and negative value.  A second way would
be to store the maximum and minimum (most negative) values for each block, and
then draw the bi-polar waveform from that.  The first way cuts the storage
requirements in half, while the second way is a bit more accurate.

Also, using the 128-sample block example, you don't necessarily need to restrict
zooming out to be in 128-sample intervals.  The peak file could be interpolated
as necessary to create arbitrary views.  I think simple "nearest-neighbor"
(zero-order) interpolation would be sufficient for this crude display-only
waveform.

"robert bristow-johnson" <rbj@audioimagination.com> wrote in message
news:BE64DA98.576D%rbj@audioimagination.com...
> in article 1111450446.938466.45180@f14g2000cwb.googlegroups.com, > seijin@gmail.com at seijin@gmail.com wrote on 03/21/2005 19:14: > > > Actually, it does make sense. > ... > > I can see that zooming out wouldn't be a problem since you're just > > pretending that 128 samples is really 1 sample. So they zoom out once > > and now 256 samples is equal to 1 sample. And instead of re-reading > > the whole file to get the min and max of 256 samples you'd just get the > > minimum and maximum of the first two "blocks" of 128 samples, right? > > exactly. and as long as your wider zoom ratio is a multiple of 128 samples > per pixel, you need not look at the audio file at all. just get your min > and max from the "peak file". > > > And then it should be fine zooming back in as long as they only zoom > > into a detail of 128 samples/pixel as that should be loaded into memory > > at that detail. > > yeah, i guess 4 meg (for 30 minutes of audio) isn't too bad to load into > memory. > > > Am I on the same level? > > i think so. > > ... > > > But if zooming in, I would have to reanalyze the whole file, wouldn't I? > > no, i don't so. > > > Because the user would want the > > ability to scroll through the whole file at a fine detail. So maybe > > they zoom in far enough where 1 pixel is equal to 50 samples - wouldn't > > I need to reanalyze the whole file by grabbing 50 samples, finding the > > min & max and then plotting? > > but i don't see why you think you need to do that for the *whole* audio > file. as the user presses the scroll left or scroll right arrows, the > display is moved some amount to the right or left (respectively) with some > of it "falling off the edge" and there is this hole in the display you have > to fill in. only the audio for that hole needs be analyzed for a min and > max per pixel. not the whole audio file. > > -- > > r b-j rbj@audioimagination.com > > "Imagination is more important than knowledge." > >
in article 3ab7k9F68uqeeU1@individual.net, Jon Harris at
goldentully@hotmail.com wrote on 03/22/2005 13:49:

> One other subtlety to mention, there are 2 different ways I've seen of dealing > with the bi-polar nature of audio. You could store a single maximum absolute > value for each block, and then display the waveform as either a simple > positive-only envelope or as a symmetrical bi-polar waveform with the absolute > maximum value used for both the positive and negative value. A second way > would > be to store the maximum and minimum (most negative) values for each block, and > then draw the bi-polar waveform from that. The first way cuts the storage > requirements in half, while the second way is a bit more accurate.
i like the second way. for each pixel, you draw a vertical line from the max value to the min value (for the entire block corresponding to that pixel). it works all the way down to 1 sample/pixel. very consistent in behavior.
> Also, using the 128-sample block example, you don't necessarily need to > restrict > zooming out to be in 128-sample intervals. The peak file could be > interpolated > as necessary to create arbitrary views.
i think that's bad. each pixel maps to a contiguous segment of samples and you want the max value and the min value for that segment. the peak file will not have the information of exactly where the max and min where nor if there was another peak that almost hit the max (but lost the contest with the real max). that other peak (which ain't in the peak file) might become the true peak for a remapped segment of audio if you choose the zoom ratio to any arbitrary view unless the new view is a multiple of 128/pixel (of which you can figger out the max and min from the peak file).
> I think simple "nearest-neighbor" > (zero-order) interpolation would be sufficient for this crude display-only > waveform.
i don't think it would look so good. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
On 21 Mar 2005 16:14:07 -0800, "seijin@gmail.com" <seijin@gmail.com>
wrote:

>Actually, it does make sense. But if zooming in, I would have to >reanalyze the whole file, wouldn't I? Because the user would want the >ability to scroll through the whole file at a fine detail. So maybe >they zoom in far enough where 1 pixel is equal to 50 samples - wouldn't >I need to reanalyze the whole file by grabbing 50 samples, finding the >min & max and then plotting? > >I can see that zooming out wouldn't be a problem since you're just >pretending that 128 samples is really 1 sample. So they zoom out once >and now 256 samples is equal to 1 sample. And instead of re-reading > >the whole file to get the min and max of 256 samples you'd just get the >minimum and maximum of the first two "blocks" of 128 samples, right? >And then it should be fine zooming back in as long as they only zoom >into a detail of 128 samples/pixel as that should be loaded into memory >at that detail. > >Am I on the same level?
Yes, this appears to be how most audio editors work, generating some sort of file that's much smaller than the original, but that effectively has the envelope, and this peak file is used for fast displays of larger portions of the file. Cool Edit and N-Track Studio (and probably Goldwave, but I don't remember) do all their scanning and generation of the peak file when a file is opened or as it's recorded, and leave the file with the name of the .wav file but with something like a .pk extension. Also, cdwave(.com) does this scanning, but apparently only uses one file for peak data. If you reload the last file it's much faster, but if you load a different file and then the first file, it takes the 'regular' slow time to scan. If I were writing something like that it would be tempting to decode one of the file formats and use it, but there might be some legal problems with doing that. This sort of thing is probably well-documented somewhere, perhaps harmony-central.com. Ask the guy at cdwave.com how he does it (even though there are only two display sizes). I recall a poster here (comp.dsp) years ago saying he was writing a unix/linux audio editor, maybe you could hunt him down and ask him. ----- http://mindspring.com/~benbradley
On Tue, 22 Mar 2005 10:49:44 -0800, "Jon Harris"
<goldentully@hotmail.com> wrote:



>One other subtlety to mention, there are 2 different ways I've seen of dealing >with the bi-polar nature of audio. You could store a single maximum absolute >value for each block, and then display the waveform as either a simple >positive-only envelope or as a symmetrical bi-polar waveform with the absolute >maximum value used for both the positive and negative value. A second way would >be to store the maximum and minimum (most negative) values for each block, and >then draw the bi-polar waveform from that. The first way cuts the storage >requirements in half, while the second way is a bit more accurate.
I think it's important to do the second way, as many sounds (especially the most common things recorded, voice and musical instruments) are asymetrical, and if you do only one half of the waveform, it could be heavily clipped on the other half and you wouldn't know it.
>Also, using the 128-sample block example, you don't necessarily need to restrict >zooming out to be in 128-sample intervals. The peak file could be interpolated >as necessary to create arbitrary views. I think simple "nearest-neighbor" >(zero-order) interpolation would be sufficient for this crude display-only >waveform.
I agree and I'd think this is how they do it. ----- http://mindspring.com/~benbradley
"Ben Bradley" <ben_nospam_bradley@frontiernet.net> wrote in message
news:t8r041dr1tcpk8stsk3con5shr3h7mktbh@4ax.com...
> On Tue, 22 Mar 2005 10:49:44 -0800, "Jon Harris" > <goldentully@hotmail.com> wrote: > > > > >One other subtlety to mention, there are 2 different ways I've seen of
dealing
> >with the bi-polar nature of audio. You could store a single maximum absolute > >value for each block, and then display the waveform as either a simple > >positive-only envelope or as a symmetrical bi-polar waveform with the
absolute
> >maximum value used for both the positive and negative value. A second way
would
> >be to store the maximum and minimum (most negative) values for each block,
and
> >then draw the bi-polar waveform from that. The first way cuts the storage > >requirements in half, while the second way is a bit more accurate. > > I think it's important to do the second way, as many sounds > (especially the most common things recorded, voice and musical > instruments) are asymetrical, and if you do only one half of the > waveform, it could be heavily clipped on the other half and you > wouldn't know it.
To do it right, you would store the greater of the positive and negative peaks (i.e. the max of the absolute value). Then you would never have clipping that doesn't show up. But I agree the second method is superior since it provides more information than the first.
"robert bristow-johnson" <rbj@audioimagination.com> wrote in message
news:BE65D67E.57D3%rbj@audioimagination.com...
> in article 3ab7k9F68uqeeU1@individual.net, Jon Harris at > goldentully@hotmail.com wrote on 03/22/2005 13:49: > > > Also, using the 128-sample block example, you don't necessarily need to > > restrict > > zooming out to be in 128-sample intervals. The peak file could be > > interpolated > > as necessary to create arbitrary views. > > i think that's bad. each pixel maps to a contiguous segment of samples and > you want the max value and the min value for that segment. the peak file > will not have the information of exactly where the max and min where nor if > there was another peak that almost hit the max (but lost the contest with > the real max). that other peak (which ain't in the peak file) might become > the true peak for a remapped segment of audio if you choose the zoom ratio > to any arbitrary view unless the new view is a multiple of 128/pixel (of > which you can figger out the max and min from the peak file). > > > I think simple "nearest-neighbor" > > (zero-order) interpolation would be sufficient for this crude display-only > > waveform. > > i don't think it would look so good.
Well, I know that CoolEdit and others do allow arbitrary zoom settings, and the resulting displays look just fine. But I don't know exactly how they do it. To get around some of the problems, when interpolating between peak values, perhaps the largest one should be used. That will at least make it so you never show a peak value smaller than it actually is. And keep in mind these types of "massively zoomed out" displays are usually just rough pictures of the envelope to help you identify major features. You wouldn't typically be trying to select something as fine as 1 screen pixel when zoomed out like that. I just did a quick experiment with some really low-frequency sine waves (1-5 Hz) in CoolEdit, and it does look a bit "chunky" when the peak file is being used. You can see when it switches to using the real audio data as the waveform then becomes pristine.