I'm a complete newbie to audio processing but I'm working on a project that requires some audio imaging. The audio files will typically be 30 minute files that average about 256mb. I can graph the whole thing just file but it's slow and uses too much memory. So, I'm hoping to learn how to use peak files to speed up display times. I'm not looking for very detailed information (although that would be wonderful), I'm looking for some pointers on what I need to be watching for and doing. I know the general idea of just graphing the max value of a My main problem is what to do when the user zooms in or out of the audio file. Can anyone give me some pointers, links or book suggestions? I'd really appreciate the help.
Building Peak Files
Started by ●March 18, 2005
Reply by ●March 18, 20052005-03-18
in article 1111120728.729655.39800@o13g2000cwo.googlegroups.com, seijin@gmail.com at seijin@gmail.com wrote on 03/17/2005 23:38:> So, I'm hoping to learn how to use peak files to speed up display > times. I'm not looking for very detailed information (although that > would be wonderful), I'm looking for some pointers on what I need to be > watching for and doing. I know the general idea of just graphing the > max value of a My main problem is what to do when the user zooms in > or out of the audio file.what do you mean by "peak files"? there's some product line by a company calle "Peak Audio" and you don't mean anything related to them, do you? (it all looks hardware anyway.) if you mean having a separate file that has the data needed to display the audio (in a compressed time-scale), the concept is pretty simple. you want to pick a nominal downsampling (zoom) ratio (your display might downsample more than that). let's say it's 128 to 1, just for shits&grins. that would reduce your 256 meg to 4 meg for the "peak file". that means, for the first analysis of the audio file, you find the maximum value (the most positive or least negative) and the minimum (most negative of least positive) value for each segment of 128 samples and you store those 2 values for each segment of 128 samples in this "peak file". if you're zooming in more than that (where 1 pixel is good for something less than 128 samples) you need only to reanalyze that one segment of the audio, but it's not such a large segment. if you're zooming _out_ from that ratio of 128 to 1, then limit your zoom out ratio to be a multiple of 128 and apply the max and min operations to your "peak file" that was done at a ratio of 128. then you don't have to reanalyze the whole damn 30 minute audio file. make sense? -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
Reply by ●March 21, 20052005-03-21
Actually, it does make sense. But if zooming in, I would have to reanalyze the whole file, wouldn't I? Because the user would want the ability to scroll through the whole file at a fine detail. So maybe they zoom in far enough where 1 pixel is equal to 50 samples - wouldn't I need to reanalyze the whole file by grabbing 50 samples, finding the min & max and then plotting? I can see that zooming out wouldn't be a problem since you're just pretending that 128 samples is really 1 sample. So they zoom out once and now 256 samples is equal to 1 sample. And instead of re-reading the whole file to get the min and max of 256 samples you'd just get the minimum and maximum of the first two "blocks" of 128 samples, right? And then it should be fine zooming back in as long as they only zoom into a detail of 128 samples/pixel as that should be loaded into memory at that detail. Am I on the same level?
Reply by ●March 21, 20052005-03-21
in article 1111450446.938466.45180@f14g2000cwb.googlegroups.com, seijin@gmail.com at seijin@gmail.com wrote on 03/21/2005 19:14:> Actually, it does make sense....> I can see that zooming out wouldn't be a problem since you're just > pretending that 128 samples is really 1 sample. So they zoom out once > and now 256 samples is equal to 1 sample. And instead of re-reading > the whole file to get the min and max of 256 samples you'd just get the > minimum and maximum of the first two "blocks" of 128 samples, right?exactly. and as long as your wider zoom ratio is a multiple of 128 samples per pixel, you need not look at the audio file at all. just get your min and max from the "peak file".> And then it should be fine zooming back in as long as they only zoom > into a detail of 128 samples/pixel as that should be loaded into memory > at that detail.yeah, i guess 4 meg (for 30 minutes of audio) isn't too bad to load into memory.> Am I on the same level?i think so. ...> But if zooming in, I would have to reanalyze the whole file, wouldn't I?no, i don't so.> Because the user would want the > ability to scroll through the whole file at a fine detail. So maybe > they zoom in far enough where 1 pixel is equal to 50 samples - wouldn't > I need to reanalyze the whole file by grabbing 50 samples, finding the > min & max and then plotting?but i don't see why you think you need to do that for the *whole* audio file. as the user presses the scroll left or scroll right arrows, the display is moved some amount to the right or left (respectively) with some of it "falling off the edge" and there is this hole in the display you have to fill in. only the audio for that hole needs be analyzed for a min and max per pixel. not the whole audio file. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
Reply by ●March 22, 20052005-03-22
CoolEdit uses this general technique, and they call them peak files (.pk extension), so that is probably where the terminology came from (not Peak Audio who make CobraNet). It sounds like you guys already have this pretty well nailed down, but to summarize, the basic idea is that the peak file is a down-sampled version of the original audio, but the downsampling process takes the maximum absolute value of each block of say 128 samples, rather than using traditional filtering/decimation process. The peak file is used to draw the waveform when you are zoomed out, and when you are zoomed in, you only need a small portion of the audio file, so you use the actual audio data. The peak file is created either during recording, or if you are opening an existing file, when the file is opened. CoolEdit will save the peak files along with the original .wav file so that on subsequent opens, it need not be recalculated. They must somehow deal with the problem of keeping the peak file in sync with the wave file when you copy/paste, change volume, etc.. A brute force method would be to simplify rescan the whole file after every change and recreate the peak file anew--simple, fool-proof, but very inefficient especially with large files. Smarter algorithms that only update the portion affected are probably used, and the rescanning is probably build into the processing routines so the audio data only needs to be accessed once. One other subtlety to mention, there are 2 different ways I've seen of dealing with the bi-polar nature of audio. You could store a single maximum absolute value for each block, and then display the waveform as either a simple positive-only envelope or as a symmetrical bi-polar waveform with the absolute maximum value used for both the positive and negative value. A second way would be to store the maximum and minimum (most negative) values for each block, and then draw the bi-polar waveform from that. The first way cuts the storage requirements in half, while the second way is a bit more accurate. Also, using the 128-sample block example, you don't necessarily need to restrict zooming out to be in 128-sample intervals. The peak file could be interpolated as necessary to create arbitrary views. I think simple "nearest-neighbor" (zero-order) interpolation would be sufficient for this crude display-only waveform. "robert bristow-johnson" <rbj@audioimagination.com> wrote in message news:BE64DA98.576D%rbj@audioimagination.com...> in article 1111450446.938466.45180@f14g2000cwb.googlegroups.com, > seijin@gmail.com at seijin@gmail.com wrote on 03/21/2005 19:14: > > > Actually, it does make sense. > ... > > I can see that zooming out wouldn't be a problem since you're just > > pretending that 128 samples is really 1 sample. So they zoom out once > > and now 256 samples is equal to 1 sample. And instead of re-reading > > the whole file to get the min and max of 256 samples you'd just get the > > minimum and maximum of the first two "blocks" of 128 samples, right? > > exactly. and as long as your wider zoom ratio is a multiple of 128 samples > per pixel, you need not look at the audio file at all. just get your min > and max from the "peak file". > > > And then it should be fine zooming back in as long as they only zoom > > into a detail of 128 samples/pixel as that should be loaded into memory > > at that detail. > > yeah, i guess 4 meg (for 30 minutes of audio) isn't too bad to load into > memory. > > > Am I on the same level? > > i think so. > > ... > > > But if zooming in, I would have to reanalyze the whole file, wouldn't I? > > no, i don't so. > > > Because the user would want the > > ability to scroll through the whole file at a fine detail. So maybe > > they zoom in far enough where 1 pixel is equal to 50 samples - wouldn't > > I need to reanalyze the whole file by grabbing 50 samples, finding the > > min & max and then plotting? > > but i don't see why you think you need to do that for the *whole* audio > file. as the user presses the scroll left or scroll right arrows, the > display is moved some amount to the right or left (respectively) with some > of it "falling off the edge" and there is this hole in the display you have > to fill in. only the audio for that hole needs be analyzed for a min and > max per pixel. not the whole audio file. > > -- > > r b-j rbj@audioimagination.com > > "Imagination is more important than knowledge." > >
Reply by ●March 22, 20052005-03-22
in article 3ab7k9F68uqeeU1@individual.net, Jon Harris at goldentully@hotmail.com wrote on 03/22/2005 13:49:> One other subtlety to mention, there are 2 different ways I've seen of dealing > with the bi-polar nature of audio. You could store a single maximum absolute > value for each block, and then display the waveform as either a simple > positive-only envelope or as a symmetrical bi-polar waveform with the absolute > maximum value used for both the positive and negative value. A second way > would > be to store the maximum and minimum (most negative) values for each block, and > then draw the bi-polar waveform from that. The first way cuts the storage > requirements in half, while the second way is a bit more accurate.i like the second way. for each pixel, you draw a vertical line from the max value to the min value (for the entire block corresponding to that pixel). it works all the way down to 1 sample/pixel. very consistent in behavior.> Also, using the 128-sample block example, you don't necessarily need to > restrict > zooming out to be in 128-sample intervals. The peak file could be > interpolated > as necessary to create arbitrary views.i think that's bad. each pixel maps to a contiguous segment of samples and you want the max value and the min value for that segment. the peak file will not have the information of exactly where the max and min where nor if there was another peak that almost hit the max (but lost the contest with the real max). that other peak (which ain't in the peak file) might become the true peak for a remapped segment of audio if you choose the zoom ratio to any arbitrary view unless the new view is a multiple of 128/pixel (of which you can figger out the max and min from the peak file).> I think simple "nearest-neighbor" > (zero-order) interpolation would be sufficient for this crude display-only > waveform.i don't think it would look so good. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
Reply by ●March 22, 20052005-03-22
On 21 Mar 2005 16:14:07 -0800, "seijin@gmail.com" <seijin@gmail.com> wrote:>Actually, it does make sense. But if zooming in, I would have to >reanalyze the whole file, wouldn't I? Because the user would want the >ability to scroll through the whole file at a fine detail. So maybe >they zoom in far enough where 1 pixel is equal to 50 samples - wouldn't >I need to reanalyze the whole file by grabbing 50 samples, finding the >min & max and then plotting? > >I can see that zooming out wouldn't be a problem since you're just >pretending that 128 samples is really 1 sample. So they zoom out once >and now 256 samples is equal to 1 sample. And instead of re-reading > >the whole file to get the min and max of 256 samples you'd just get the >minimum and maximum of the first two "blocks" of 128 samples, right? >And then it should be fine zooming back in as long as they only zoom >into a detail of 128 samples/pixel as that should be loaded into memory >at that detail. > >Am I on the same level?Yes, this appears to be how most audio editors work, generating some sort of file that's much smaller than the original, but that effectively has the envelope, and this peak file is used for fast displays of larger portions of the file. Cool Edit and N-Track Studio (and probably Goldwave, but I don't remember) do all their scanning and generation of the peak file when a file is opened or as it's recorded, and leave the file with the name of the .wav file but with something like a .pk extension. Also, cdwave(.com) does this scanning, but apparently only uses one file for peak data. If you reload the last file it's much faster, but if you load a different file and then the first file, it takes the 'regular' slow time to scan. If I were writing something like that it would be tempting to decode one of the file formats and use it, but there might be some legal problems with doing that. This sort of thing is probably well-documented somewhere, perhaps harmony-central.com. Ask the guy at cdwave.com how he does it (even though there are only two display sizes). I recall a poster here (comp.dsp) years ago saying he was writing a unix/linux audio editor, maybe you could hunt him down and ask him. ----- http://mindspring.com/~benbradley
Reply by ●March 22, 20052005-03-22
On Tue, 22 Mar 2005 10:49:44 -0800, "Jon Harris" <goldentully@hotmail.com> wrote:>One other subtlety to mention, there are 2 different ways I've seen of dealing >with the bi-polar nature of audio. You could store a single maximum absolute >value for each block, and then display the waveform as either a simple >positive-only envelope or as a symmetrical bi-polar waveform with the absolute >maximum value used for both the positive and negative value. A second way would >be to store the maximum and minimum (most negative) values for each block, and >then draw the bi-polar waveform from that. The first way cuts the storage >requirements in half, while the second way is a bit more accurate.I think it's important to do the second way, as many sounds (especially the most common things recorded, voice and musical instruments) are asymetrical, and if you do only one half of the waveform, it could be heavily clipped on the other half and you wouldn't know it.>Also, using the 128-sample block example, you don't necessarily need to restrict >zooming out to be in 128-sample intervals. The peak file could be interpolated >as necessary to create arbitrary views. I think simple "nearest-neighbor" >(zero-order) interpolation would be sufficient for this crude display-only >waveform.I agree and I'd think this is how they do it. ----- http://mindspring.com/~benbradley
Reply by ●March 22, 20052005-03-22
"Ben Bradley" <ben_nospam_bradley@frontiernet.net> wrote in message news:t8r041dr1tcpk8stsk3con5shr3h7mktbh@4ax.com...> On Tue, 22 Mar 2005 10:49:44 -0800, "Jon Harris" > <goldentully@hotmail.com> wrote: > > > > >One other subtlety to mention, there are 2 different ways I've seen ofdealing> >with the bi-polar nature of audio. You could store a single maximum absolute > >value for each block, and then display the waveform as either a simple > >positive-only envelope or as a symmetrical bi-polar waveform with theabsolute> >maximum value used for both the positive and negative value. A second waywould> >be to store the maximum and minimum (most negative) values for each block,and> >then draw the bi-polar waveform from that. The first way cuts the storage > >requirements in half, while the second way is a bit more accurate. > > I think it's important to do the second way, as many sounds > (especially the most common things recorded, voice and musical > instruments) are asymetrical, and if you do only one half of the > waveform, it could be heavily clipped on the other half and you > wouldn't know it.To do it right, you would store the greater of the positive and negative peaks (i.e. the max of the absolute value). Then you would never have clipping that doesn't show up. But I agree the second method is superior since it provides more information than the first.
Reply by ●March 22, 20052005-03-22
"robert bristow-johnson" <rbj@audioimagination.com> wrote in message news:BE65D67E.57D3%rbj@audioimagination.com...> in article 3ab7k9F68uqeeU1@individual.net, Jon Harris at > goldentully@hotmail.com wrote on 03/22/2005 13:49: > > > Also, using the 128-sample block example, you don't necessarily need to > > restrict > > zooming out to be in 128-sample intervals. The peak file could be > > interpolated > > as necessary to create arbitrary views. > > i think that's bad. each pixel maps to a contiguous segment of samples and > you want the max value and the min value for that segment. the peak file > will not have the information of exactly where the max and min where nor if > there was another peak that almost hit the max (but lost the contest with > the real max). that other peak (which ain't in the peak file) might become > the true peak for a remapped segment of audio if you choose the zoom ratio > to any arbitrary view unless the new view is a multiple of 128/pixel (of > which you can figger out the max and min from the peak file). > > > I think simple "nearest-neighbor" > > (zero-order) interpolation would be sufficient for this crude display-only > > waveform. > > i don't think it would look so good.Well, I know that CoolEdit and others do allow arbitrary zoom settings, and the resulting displays look just fine. But I don't know exactly how they do it. To get around some of the problems, when interpolating between peak values, perhaps the largest one should be used. That will at least make it so you never show a peak value smaller than it actually is. And keep in mind these types of "massively zoomed out" displays are usually just rough pictures of the envelope to help you identify major features. You wouldn't typically be trying to select something as fine as 1 screen pixel when zoomed out like that. I just did a quick experiment with some really low-frequency sine waves (1-5 Hz) in CoolEdit, and it does look a bit "chunky" when the peak file is being used. You can see when it switches to using the real audio data as the waveform then becomes pristine.