Forums

Subject: Music & Acoustic Transcription (was: The Holy Grail of Transforms & Component Extraction)

Started by Alfred Einstead May 30, 2013
I will follow up on the note below with some You Tube demos, along
with an outline of the analysis method used to recover the "Holy
Grail", as it were (i.e. clean separation of acoustic signals into
their natural "chirp" components, in the same way that a person
decomposes the sounds when hearing them).

From 2012 Dec 15
http://groups.google.com/group/comp.dsp/msg/fb55797342ea084e
> The hybrid of the time-frequency and time-scale are the S-transforms. > They have some rather unusual and extremely properties that even the > research literature doesn't (yet) know of.
> I'll do a quick run up to that, since I have an interest in it right > now, because of the above-mentioned "unusual and heretofore unknown > properties." Among other things, it recovers the concept (well-known > to physicists) of "instantaneous frequency" and it leads directly to a > *non-linear* transform that removes the problem of spectral leakage.
(The property not "well-known", BTW, was not the "instantaneous frequency" concept, but was that the S-transform has a version of the Parseval Theorem, but that's another issue I won't be discussing here.)
> It's better to map the amplitude as brightness and the phase as color. > Then you'll end up seeing some rather interesting (and revealing) > patterns. Colorizing the transform shows the first signs of the > emergence of the Holy Grail that I'm leading up to.
You can see this, for instance, in the following You Tube, where I transcribe a drum solo at 1/4 speed with a (really really sloppily coded & slightly modified) version of the S-transform. http://www.youtube.com/watch?v=6orozX1GD1w What's modified in the transform is that I normalise the phase of the transrform so that the following property holds: * Tones transform to tones with the same phase and frequency. The phase is color-coded and the one thing that stands out is that it is *almost independent* of the frequency at which the transform tunes in at. You can see clearly-distinguishable groups, once the phase is brought out in this way. The one thing that particularly stands out is the real-time tracking on the instantaneous frequency, which can be directly seen by counting the number of phase cycles per time unit (the units in the video are 1 pixel = 1/5280 second). The bass drum, for instance, drops down from 85 Hz to 60 Hz, and you can even see the echo resonating in real time. The separation is generally true for any acoustic or wave phenomenon where separate (possibly variable-frequency) components are aded in -- as long as you don't have too many crowded together on the same signal.
> Finally, this leads up to the Holy Grail. Since the natural frequency > of this component is n(q, p), then it is just as natural to redraw the > spectrograph by moving this amplitude up from frequency p to frequency n.
This is seen in the second You Tube video: http://www.youtube.com/watch?v=itUSUau6DJM which shows the same drum solo, first at 100% speed, then 25% speed with the frequencies localized -- and then does the same thing for a 15 second segment of the "2001 theme" (at 1/4 speed, for 1 minute) along with a House music segment that has both voice and electronica in it. The one thing that stands out the most is that the spectrum is SINGULAR. It is virtualy al concentrated on a web of "chirp lines". The only reason you see the extra smoky residue of lines is because I enhanced the brightness from the raw readout to better show the intensities. The original was virtually all black, except on the chirp lines themselves. The phase color-coding is averaged for the higher frequencies, so it shows up as white. The averaging is done by taking the brightness equal to the time-average amplitude, and the color saturation to the average coherence of the signal over the time interval. Low frequencies are coherent, while high frequencies show up as grey- shades. The coding is sloppy, because it is only meant to be a proof of concept. In particular, different phase-estimating methods were used in the different segments as the basis for determinig the instantaneous frequency. For the House Music segment, for instance, you see quantization of the "chirp lines" which arises as an artifact coming from the fact that I used a time step in doing the estimation of the phase time derivative. (And, of course, I can't let the house music demo go without doing a shameless plug for some of the experimentation I've been doing spectrographic-based remixing and combined biological-machine voice synthesis :) The Beast Stomp: https://www.youtube.com/watch?v=FrnuJp9eoRw ) The ideal form of the analysis that results in the clean separation can be done as follows. The "ideal" part of the analysis is the color coding of the phase. A method is required to both do and undo the phase-averaging that is necessary when going beyond the resolution of the graph. (a) First get a rough separation of frequencies (e.g. the first You Tube video), so as to separate out the components at least to some degree. Any transform may be used as long as (a) it is approximately scale- invariant, (b) preserve the phase of monochromatic tones. The forward transform to convert a signal X into a complex spectrum Y(q,p) parametrized by time (q) and frequency (p). (b) Second, localize the spectrum to its instantaneous frequency, which is determined by the phase (e.g. the second You Tube video). IF is the time derivative of the phase. The IF (nu) is determined for each frequency bin (p) at each time (q). One way to estimate it is to multiply the transform by the frequency before carrying the transform to obtain the time derivative directly. That's something I haven't yet tested. (c) Add the complex amplitude Y(q,p) to Y_loc(q,nu(q,p)). So, the contribution that originally went into the p-bin gets moved into the nu(q,p)-bin. All this is added up. (d) Perform an inverse transform on Y_loc a per frequency basis, for each nu to obtain a set of voices X_nu for each frequency nu. The result is a set of voices that are displayed concurrently on separate tracks to make a spectogram (or scalogram). The frequencies for the "voices" can be freely chosen, provided the right normalizations are used -- particular so as to make property (e) valid. (e) The original signal is simply the sum X = sum_nu X_nu. This is where the "ideal" part of the analysis enters the picture. To get an exact reproduction requires the exact phase be retained in the graph, itself, or represented by some other (useable) means. Otherwise, phase estimation is going to have to be carried out. The primary issue here is that the chirp lines may cross bins, while yet you want the phase to remain continuous on each chirp line. This requires using some kind of left-to-right algorithm to come up with the best estimate for the phase in each bin, based on the phases (and intensities) of the nearest frequency bins in the immediately preceding time step.
Pretty pictures.  Have you transcribed anything with it yet?
On May 30, 3:36&#2013266080;pm, Alfred Einstead <federation2...@netzero.com>
wrote:
> I will follow up on the note below with some You Tube demos, along > with an outline of the analysis method used to recover the "Holy > Grail", as it were (i.e. clean separation of acoustic signals into > their natural "chirp" components, in the same way that a person > decomposes the sounds when hearing them). > > From 2012 Dec 15http://groups.google.com/group/comp.dsp/msg/fb55797342ea084e > > > The hybrid of the time-frequency and time-scale are the S-transforms. > > They have some rather unusual and extremely properties that even the > > research literature doesn't (yet) know of.
okay, on that post you (if you're the same as Mark) said:
> The S-transform fixes the problem with g. In its absolutely most > general form it is defined by > f_S(q, p) = integral f(t) (|p| g(p(t - q)))* dt > and has inverse > f(t) = integral f_S(q, p) 1^{p(t - q)} dq dp
not sure what the 1^some_power means. q is delay (or shift in t) and p is scale. g(t) is a normalized window kernel. how is this different from a continuous wavelet transform? r b-j