Forums

Some Notes on Strategies for Automated Music Transcription.

Started by Mark June 13, 2012
I'm looking through the net for papers, references, and other writeups
on the topic and am not finding a whole lot -- at least not a lot
that's particularly insightful.

The key feature of the filter is that it should have a reproducing
result: the ability to play back what it transcribes.

What I'm looking for is some way to organize the analysis around a
large digital sample archive (freely available, downloaded from a
world class symphony orchestra). The most important thing I've ruled
out, based on a preliminary analysis, is the use of any generic scale-
invariant method. The actual instruments sampled do not transform
covariantly under a change of pitch (i.e. the quality of the sound
changes at different pitches). Consequently, the naive wavelet-based
approach is ruled out, along with anything similar. In addition,
sampling based on a discretization of the time-scale plane (i.e. with
linear time, but frequency on a logarithmic scale) is out.

Instead, I've decided to indiscriminately throw in the different
pitches contained in the archive and treat them as separate templates.
On playback, however, rescaling can be used to derive "blue note"
pitches, or to derive the minor alterations required to tune to
different music scales.

The percussions are handled separately, possibly coding the full
duration rather than a steady state run (after taking into account the
quasi-melodic instruments, like the timpani and triangle and
separating out the individual beats of instruments like the snare
drum).

A review of the literature brings up the notion of a frame, e.g. in
the sense of Kaiser (the somewhat misnamed "A Friendly Guide to
Wavelets", 1994, Birkhauser), which substantially generalizes on
Daubechies (Chapter 3, 1992, "Ten Lectures on Wavelets", Rutgers /
AT&T).

A Hilbert space frame provides a "signal space" H, a "transcription
space" M (which is a measure space, with measure mu). The
"transcription operator" T: H -> L^2(mu) produces measurable functions
for each Hilbert space vector, and a "synthesis operator" S: L^2(mu) -
> H produces a playback. For example, for windowed Fourier transforms,
the function g_qp(x) = g(x - q) exp(2pi i px) yields an element g_qp of the Hilbert space H = L^2(R) parametrized by (q,p) in M = L^2(R^2). However, Kaiser's focus is on coding applications, where one wants complete reproducibility. Consequently, Kaiser has ST = 1_H: H -> H, while TS: L^2(mu) -> L^2(mu) is a projection operator (ergo TS = S* T*, where ()* denotes adjoint). The analysis is built around the property that G = T* T be a strictly positive bounded operator, bounded away from 0. This is the "frame condition". As is well-known in the literature, for instance, one can have frames satisfying this condition, even with M replaced by discrete subsets of M, as long as the spacings between the (p,q) are at or above a certain critical density. A little thought, though, makes it clear that the frame condition is what you do NOT want, for a feature extraction application. In effect, with this kind of filter, if all you did was (to use an extreme case) code the violins, and if the frame condition held, you'd end up getting a filter that follows the aphorism "when all you have is a hammer, then everything starts to look like a nail", and your trumpets will be played back like violins. That may work, and you may end up getting superpositions of violins to sound like trumpets, but the transcription will end up recording the sounds as violins being played, which misses the whole point of the exercise. The whole point of "extraction" is that you should leave something behind; i.e. there should be allowance for the Sounds of Silence (sounds that leave behind no transcription). So, ST has to have a non- trivial kernel. Those Sounds of Silence are the "other" sounds that are not supposed to be the violins. So, basically, I redid the Kaiser definition by only posing the properties that (a) S and T are generalized inverses -- STS = S and TST = T; and (b) ST and TS are both self-adjoint -- ST = T* S* and TS = S* T*. So, the signal space H decomposes into a part K which has transcription and a kernel H' which contains all the Sounds of Silence; while the measure space L^2(mu) continues to have (as it did with Kaiser) the subspace F of transcriptions. The orthogonal complement of F (as with Kaiser's frames) is the set of "transcriptions" that play back as silence: they're the kernel of the S operator and the projection operator P = TS. So, what I'm doing is replacing Kaiser's resolution of identity ST = I with the projection operator Q = ST for the K-subspace. So, now with that out of the way, the basic problem is how to get the templates accounted for. The basic strategy I'm looking at is this: start with the basic steady-state tones, with short time windows (e.g. 1/44 second). See how far this can carry you by actually running the filter on the rest of the archive itself. Whatever outliers are found are then added to the template set and the filter is remade. The process is repeated until the whole archive is brought within a given level of accuracy. But the real problem is what time scale to use. If you run the time scale too small, you end up getting a frame in the sense of Kaiser. The 1/44 second seems like a good compromise, but I haven't tested it, and I don't know if there's anything in the archive that has a smaller effective time resolution. The snare drums don't seem to fall much under 1/20 second. In particular, with the compromise I have enough resolution for the changes in timbre for a given instrument played in a non-steady style to register as time-dependent changes. That's the intent, for the first stage. For the second stage of the analysis, the intent is to extract playing style and changes in quality straight off the transcription itself and assign these as "macros" -- both for transcription and playback -- rather than coding them as templates. So, the extraction filter is set up as a least squares fit. Each template (f_1, f_2, ..., f_m) is set up with all its discrete time translates (f_ij(t) = f_i(t - tj), tj having 1/44 second steps, and fi possibly having a 1/22 or 1/44 second window to extract out the steady state as a localized waveform). The linear fit for a signal g(t) is just g(t) = sum(g^ij f_ij(t)) + dg(t), the fidelity ratio is <dg,dg>/ <g,g> and is used to test goodness of fit.