I've been assigned a project, part of which pushes dangerously at my envelope as a commercial software designer, and I'd be grateful for some comments on the following scenario. There's a real-time audio feed (good phone quality) which needs to be digitised, frequency-analysed, the bin values processed in some yet-to-be-specified manner, and then resynthesised. The result needs to be indistinguishable from the original, apart from the effects of the processing. The questions that occur to me are: 1. Should 44.1KHz 16-bit sampling suffice 2. How many frequency bins are indicated 3. What kind of delay should be expected (ignoring the unspecified processing stage) Are there any tools out there which might allow me to play around with this stuff and get a feel for different parameters? Thanks, Matti
Novice question on frequency analysis & resysnthesis
Started by ●April 28, 2004
Reply by ●April 28, 20042004-04-28
Matti Lamprhey wrote:> I've been assigned a project, part of which pushes dangerously at my > envelope as a commercial software designer, and I'd be grateful for some > comments on the following scenario.> There's a real-time audio feed (good phone quality) which needs to be > digitised, frequency-analysed, the bin values processed in some > yet-to-be-specified manner, and then resynthesised. The result needs to > be indistinguishable from the original, apart from the effects of the > processing.> The questions that occur to me are: > 1. Should 44.1KHz 16-bit sampling suffice > 2. How many frequency bins are indicated > 3. What kind of delay should be expected (ignoring the unspecified > processing stage)My first thought, after all the Fourier transform discussion, was to use FFT, but that doesn't work for a real time signal, and in any case may not be what you want. Digitized phone is 8kHz, 16 bit compressed to 8 using mu-law or A-law coding. In any case, 44.1kHz 16 bit should be plenty. The number of bins should depend on what type of processing you want to do. To me, it looks something like a graphic equalizer for a stereo system where you separate the signal by frequency bands, adjust the amplitude of each, and then combine the result. If you are careful with the math, it should come out just like the original in the default case of no processing. In the two band case, you take the original signal and run it through a low pass filter. For the high band, subtract the low band result from the appropriately delayed original signal, which guarantees that the sum will be the original signal again. The number of bands (bins) depends on what processing you want to do, and how much compute power you have to do it. The delay depends on the sharpness of the filters, again dependent on processing power. I will see what others have to say about this problem. -- glen
Reply by ●April 28, 20042004-04-28
FFT can be used in "almost-real-time". It depends on what latency is acceptable, and on what processing is required. A streaming phase vocoder incurs a latency of one FFT frame plus an extra bit for the frame overlap (fraction of one frame). If you only need to do amplitude modifications (with respect to frequency) the frames can be kept relatively small, and hence also the overlap. For frequency-based modifications sutch as pitch scaling, the FFT size generally needs to be 1024 at the 44100Hz sample rate, and the overlap at least fourfold; depending partly on how big a shift you want to make to the frequencies, and to what extent you need to track low or time-varying frequency components. This leads to a latency around the 25msec mark. Smaller frame sizes can take this down to 6msecs or so, and lower sample rates will take it down further pro rata. It is a fairly direct quality/CPU load tradeoff. I got a basic identity-process phase vocoder running at 16KHz srate on the original SHARC Ez-Kit, with 50% overlap, without glitches. You can try out a streaming phase vocoder in real-time by running Csound (open-source synthesis and processing program for music): http://www.csounds.com/ Or if you are on Windows you could download the demo of "Project 5" by Cakewalk (www.cakewalk.com) and try out the "Spectral Transformer" plugin by yours truly; though the transformations will probably be too weird for your needs (?). Of course, this may be way OTT for your needs... Richard Dobson glen herrmannsfeldt wrote:> Matti Lamprhey wrote: > >> I've been assigned a project, part of which pushes dangerously at my >> envelope as a commercial software designer, and I'd be grateful for some >> comments on the following scenario. > > >> There's a real-time audio feed (good phone quality) which needs to be >> digitised, frequency-analysed, the bin values processed in some >> yet-to-be-specified manner, and then resynthesised. The result needs to >> be indistinguishable from the original, apart from the effects of the >> processing. > > >> The questions that occur to me are: >> 1. Should 44.1KHz 16-bit sampling suffice >> 2. How many frequency bins are indicated >> 3. What kind of delay should be expected (ignoring the unspecified >> processing stage) > > > My first thought, after all the Fourier transform discussion, was > to use FFT, but that doesn't work for a real time signal, and in any > case may not be what you want. >
Reply by ●April 29, 20042004-04-29
# glen herrmannsfeldt> My first thought, after all the Fourier transform discussion, was > to use FFT, but that doesn't work for a real time signal,Isn't realtime FFT possible if you are happy to delay the signal chain and buffer up enough audio to window? -- Toby asktoby.com BSOD VST & ME
Reply by ●April 29, 20042004-04-29
"Richard Dobson" <richarddobson@blueyonder.co.uk> wrote...> FFT can be used in "almost-real-time". It depends on what latency is > acceptable, and on what processing is required. A streaming phase > vocoder incurs a latency of one FFT frame plus an extra bit for the > frame overlap (fraction of one frame). If you only need to do > amplitude modifications (with respect to frequency) the frames can > be kept relatively small, and hence also the overlap. > For frequency-based modifications sutch as pitch scaling, the FFT size > generally needs to be 1024 at the 44100Hz sample rate, and the > overlap at least fourfold; depending partly on how big a shift you > want to make to the frequencies, and to what extent you need to track > low or time-varying frequency components. This leads to a latency > around the 25msec mark. Smaller frame sizes can take this down to > 6msecs or so, and lower sample rates will take it down further pro > rata. It is a fairly direct quality/CPU load tradeoff. I got a > basic identity-process phase vocoder running at 16KHz srate on the > original SHARC Ez-Kit, with 50% overlap, without glitches. > > You can try out a streaming phase vocoder in real-time by running > Csound (open-source synthesis and processing program for music): > > http://www.csounds.com/ > > Or if you are on Windows you could download the demo of "Project 5" by > Cakewalk (www.cakewalk.com) and try out the "Spectral Transformer" > plugin by yours truly; though the transformations will probably be too > weird for your needs (?). > > Of course, this may be way OTT for your needs...These responses have been extremely helpful, and I'll examine both resources you list here, Richard. I'm told that the processing that's required can be described as "amplitude-only" with little or no need to track things between frames or across bins. I'd like to check my assumptions on a few things here, if I may: 1. Is 500 bins likely to be the right order of magnitude? (Or should I be expecting 50, or 5000 perhaps?) 2. Are the bins a fixed width, or does their width vary in direct proportion to their midpoint frequency (as seems intuitively more appropriate to me)? 3. Is the data presented to the processing stage by the analysis stage simply a set of bin amplitudes at each sampling event or frame? 4. If (assuming no significant delay imposed by the unspecified processing stage) I need to aim at a maximum of 50ms delay overall, should this be feasible using readily-available and low-price hardware? I'm guessing from what you said above that this shouldn't be a problem. Matti
Reply by ●April 29, 20042004-04-29
Matti Lamprhey wrote: ....> > I'm told that the processing that's required can be described as > "amplitude-only" with little or no need to track things between frames > or across bins.This begs the question - do you actually need the FFT? Are you ~required~ to use it, or do you have control over the choice of solution? Really hard to make suggestions without knowing the details! Is this an industrial/commercial project, or an academic/student one? From what you write, you have been told the solution but not the problem!> > I'd like to check my assumptions on a few things here, if I may: > > 1. Is 500 bins likely to be the right order of magnitude? (Or should I > be expecting 50, or 5000 perhaps?) >What's the lowest frequency you need to process? Or, what is the finest frequency resolution you need? A 1024-sample FFT will get converted to a pvoc analysis frame of (1024/2)+1 bins comprising amp/freq (or amp/phase) pairs, from DC to Nyquist. That's a useful size for general processing (assuming a standard srate such as 44100), but can be too small for demanding frequency modification; you may even be able to get away with a smaller FFT size if you don't need a fine frequency resolution. You can define the "fundamental frequency" of FFT analysis as srate/FFTsize. The smaller the FFT, the less accurate the processing of low frequencies will be.> 2. Are the bins a fixed width, or does their width vary in direct > proportion to their midpoint frequency (as seems intuitively more > appropriate to me)?The FFT defines a linear frequency spacing, so pvoc does too. All the usual FFT interpretative issues (spectral leakage etc) apply equally to a pvoc frame. There is a "log-FFT", which I have never used, and about which I know nothing, other than that it implements a log frequency spacing.> > 3. Is the data presented to the processing stage by the analysis stage > simply a set of bin amplitudes at each sampling event or frame? >See above. Pvoc processing uses overlapping analysis frames; the overlap is typically some integer fraction of one frame. E.g. with FFT=1024, a new frame would be calculated every 256 samples. The minimum you can get away with is 50% overlap (using a Hamming window). Hence, we can refer to the FFT "analysis rate", which is srate / overlap. The higher the rate the better pvoc tracks varying frequency components and transients. The ultimate is the "sliding FFT" which updates every sample. There is an example implementation for a Texas DSP chip somewhere on the net. No phase vocoder has been developed to date (AFAIK) using the sliding FFT - working on it!> 4. If (assuming no significant delay imposed by the unspecified > processing stage) I need to aim at a maximum of 50ms delay overall, > should this be feasible using readily-available and low-price hardware? > I'm guessing from what you said above that this shouldn't be a problem.The delay is not really a factor of the hardware other than indirectly through the choice of FFT size and overlap, which affect CPU load. I would guess that a full pvoc would be tight on a DSP at less than 100MHz clock; my main experience is on general purpose computers (pentium, PowerPC). My first real-time pvoc ran comfortably on a Pentium II 333MHz. 50ms delay is quite high, and enough to be disturbing for speech, and it would be realistic to aim for a lower value. If your processing task is very moderate, attending to only a narrow range of frequencies, you may be able to save a lot of CPU by working directly with the complex FFT data, and not doing the expensive conversion to amplitude/phase for each bin. It is difficult for me to guess what you might regard as "readily available" or "low-price"! Richard Dobson> > Matti > >
Reply by ●April 29, 20042004-04-29
"Richard Dobson" <richarddobson@blueyonder.co.uk> wrote...> Matti Lamprhey wrote: > .... > > I'm told that the processing that's required can be described as > > "amplitude-only" with little or no need to track things between > > frames or across bins. > > This begs the question - do you actually need the FFT? Are you > ~required~ to use it, or do you have control over the choice of > solution? Really hard to make suggestions without knowing the > details! Is this an industrial/commercial project, or an academic/ > student one? From what you write, you have been told the solution > but not the problem!Sorry -- this is due to my general ignorance of the field. It's a commercial project, and I need to analyse the audio stream into frequency components at a resolution sufficiently fine that resynthesis produces a stream indistinguishable from the original. I was assuming that FFT was the most appropriate mechanism, but it's not a requirement.> > > > I'd like to check my assumptions on a few things here, if I may: > > > > 1. Is 500 bins likely to be the right order of magnitude? (Or > > should I be expecting 50, or 5000 perhaps?) > > > > What's the lowest frequency you need to process? Or, what is the > finest frequency resolution you need? A 1024-sample FFT will get > converted to a pvoc analysis frame of (1024/2)+1 bins comprising > amp/freq (or amp/phase) pairs, from DC to Nyquist. That's a useful > size for general processing (assuming a standard srate such as 44100), > but can be too small for demanding frequency modification; you may > even be able to get away with a smaller FFT size if you don't need a > fine frequency resolution. You can define the "fundamental frequency" > of FFT analysis as srate/FFTsize. The smaller the FFT, the less > accurate the processing of low frequencies will be.The frequency range is that implied by "good phone quality"; human speech, primarily.> > > 2. Are the bins a fixed width, or does their width vary in direct > > proportion to their midpoint frequency (as seems intuitively more > > appropriate to me)? > > > The FFT defines a linear frequency spacing, so pvoc does too. All the > usual FFT interpretative issues (spectral leakage etc) apply equally > to a pvoc frame. There is a "log-FFT", which I have never used, and > about which I know nothing, other than that it implements a log > frequency spacing. > > > > 3. Is the data presented to the processing stage by the analysis > > stage simply a set of bin amplitudes at each sampling event or > > frame? > > > > See above. Pvoc processing uses overlapping analysis frames; the > overlap is typically some integer fraction of one frame. E.g. with > FFT=1024, a new frame would be calculated every 256 samples. The > minimum you can get away with is 50% overlap (using a Hamming > window). Hence, we can refer to the FFT "analysis rate", which is > srate / overlap. The higher the rate the better pvoc tracks varying > frequency components and transients. The ultimate is the "sliding FFT" > which updates every sample. There is an example implementation for a > Texas DSP chip somewhere on the net. No phase vocoder has been > developed to date (AFAIK) using the sliding FFT - working on it! > > > > 4. If (assuming no significant delay imposed by the unspecified > > processing stage) I need to aim at a maximum of 50ms delay overall, > > should this be feasible using readily-available and low-price > > hardware? I'm guessing from what you said above that this shouldn't > > be a problem. > > The delay is not really a factor of the hardware other than > indirectly through the choice of FFT size and overlap, which affect > CPU load. I would guess that a full pvoc would be tight on a DSP at > less than 100MHz clock; my main experience is on general purpose > computers (pentium, PowerPC). My first real-time pvoc ran comfortably > on a Pentium II 333MHz. 50ms delay is quite high, and enough to be > disturbing for speech, and it would be realistic to aim for a lower > value. If your processing task is very moderate, attending to only a > narrow range of frequencies, you may be able to save a lot of CPU by > working directly with the complex FFT data, and not doing the > expensive conversion to amplitude/phase for each bin.That's an interesting observation, but I don't know enough about the "complex FFT data" or the precise processing requirement to have an answer yet. I do know that I need to apply the processing across the whole of the relevant frequency range, however.> > It is difficult for me to guess what you might regard as "readily > available" or "low-price"!Yes -- I gather the intention is to bundle this up into a device costing less than 100GBP, if that's a guide. Thanks for your patience! I have tried searching for a dummy's guide to frequency analysis just to orientate myself a little, but haven't yet found anything at the right level... Matti
Reply by ●April 29, 20042004-04-29
Toby Newman wrote: I wrote:>>My first thought, after all the Fourier transform discussion, was >>to use FFT, but that doesn't work for a real time signal,> Isn't realtime FFT possible if you are happy to delay the signal chain and > buffer up enough audio to window?It depends on the definition of Fourier transform that you use. The integral from negative infinity to positive infinity is hard to do in real time. The piecewise transform can be done, but it isn't the same thing. -- glen
Reply by ●April 29, 20042004-04-29
glen herrmannsfeldt <gah@ugcs.caltech.edu> writes:> Toby Newman wrote: > > I wrote: > > >>My first thought, after all the Fourier transform discussion, was > >>to use FFT, but that doesn't work for a real time signal, > > > Isn't realtime FFT possible if you are happy to delay the signal chain and > > buffer up enough audio to window? > > It depends on the definition of Fourier transform that you use. > > The integral from negative infinity to positive infinity is hard > to do in real time. > > The piecewise transform can be done, but it isn't the same thing.Most definitions of the FFT that I know of assume a finite window. Ciao, Peter K. -- Peter J. Kootsookos "I will ignore all ideas for new works [..], the invention of which has reached its limits and for whose improvement I see no further hope." - Julius Frontinus, c. AD 84
Reply by ●April 30, 20042004-04-30
OK, "good phone quality" is really pretty far away from hi-fi quality, so you may not need a large FFT window; and you may not need a sample rate above 16KHz say. You may not indeed need to use FFT methods at all. It all depends on ~exactly~ what the processing task is, which you still have not described. In any case, in the absence of any modification, the pvoc process will always generate identical output, in that a plain FFT/IFFT combo does that. This is true regardless of the size of the FFT - it is an "identity transform". The issue is then entirely on what precision you need for the stated problem. The smaller the FFT, the coarser the resolution of frequency components, but if you are, say, only interested in the broader picture, such as the spectral envelope or the location of formants, a smaller FFT might be OK. But if you want to capture transients and modify them, you need more precision, etc. So, do you need to discriminate frequency components 20Hz apart? 50Hz? 100Hz? Basically, and setting aside its role as a pure analysis tool, the FFT is a fast way to employ lots of very simple filters. You may be better off with a small number of clever filters. Impossible to say without more precise description of the task. If you are not allowed to reveal this for commercial reasons, there isn't really much more that can be said. Richard Dobson Matti Lamprhey wrote:> "Richard Dobson" <richarddobson@blueyonder.co.uk> wrote... > >>Matti Lamprhey wrote: >>.... >> >>>I'm told that the processing that's required can be described as >>>"amplitude-only" with little or no need to track things between >>>frames or across bins. >> >>This begs the question - do you actually need the FFT? Are you >>~required~ to use it, or do you have control over the choice of >>solution? Really hard to make suggestions without knowing the >>details! Is this an industrial/commercial project, or an academic/ >>student one? From what you write, you have been told the solution >>but not the problem! > > > Sorry -- this is due to my general ignorance of the field. It's a > commercial project, and I need to analyse the audio stream into > frequency components at a resolution sufficiently fine that resynthesis > produces a stream indistinguishable from the original. I was assuming > that FFT was the most appropriate mechanism, but it's not a requirement. > >...> > The frequency range is that implied by "good phone quality"; human > speech, primarily. > >






