# Some Notes on Strategies for Automated Music Transcription.

Started by June 13, 2012
```I'm looking through the net for papers, references, and other writeups
on the topic and am not finding a whole lot -- at least not a lot
that's particularly insightful.

The key feature of the filter is that it should have a reproducing
result: the ability to play back what it transcribes.

What I'm looking for is some way to organize the analysis around a
world class symphony orchestra). The most important thing I've ruled
out, based on a preliminary analysis, is the use of any generic scale-
invariant method. The actual instruments sampled do not transform
covariantly under a change of pitch (i.e. the quality of the sound
changes at different pitches). Consequently, the naive wavelet-based
approach is ruled out, along with anything similar. In addition,
sampling based on a discretization of the time-scale plane (i.e. with
linear time, but frequency on a logarithmic scale) is out.

Instead, I've decided to indiscriminately throw in the different
pitches contained in the archive and treat them as separate templates.
On playback, however, rescaling can be used to derive "blue note"
pitches, or to derive the minor alterations required to tune to
different music scales.

The percussions are handled separately, possibly coding the full
duration rather than a steady state run (after taking into account the
quasi-melodic instruments, like the timpani and triangle and
separating out the individual beats of instruments like the snare
drum).

A review of the literature brings up the notion of a frame, e.g. in
the sense of Kaiser (the somewhat misnamed "A Friendly Guide to
Wavelets", 1994, Birkhauser), which substantially generalizes on
Daubechies (Chapter 3, 1992, "Ten Lectures on Wavelets", Rutgers /
AT&T).

A Hilbert space frame provides a "signal space" H, a "transcription
space" M (which is a measure space, with measure mu). The
"transcription operator" T: H -> L^2(mu) produces measurable functions
for each Hilbert space vector, and a "synthesis operator" S: L^2(mu) -
> H produces a playback. For example, for windowed Fourier transforms,
the function g_qp(x) = g(x - q) exp(2pi i px) yields an element g_qp
of the Hilbert space H = L^2(R) parametrized by (q,p) in M = L^2(R^2).

However, Kaiser's focus is on coding applications, where one wants
complete reproducibility. Consequently, Kaiser has ST = 1_H: H -> H,
while TS: L^2(mu) -> L^2(mu) is a projection operator (ergo TS = S*
T*, where ()* denotes adjoint). The analysis is built around the
property that G = T* T be a strictly positive bounded operator,
bounded away from 0. This is the "frame condition". As is well-known
in the literature, for instance, one can have frames satisfying this
condition, even with M replaced by discrete subsets of M, as long as
the spacings between the (p,q) are at or above a certain critical
density.

A little thought, though, makes it clear that the frame condition is
what you do NOT want, for a feature extraction application. In effect,
with this kind of filter, if all you did was (to use an extreme case)
code the violins, and if the frame condition held, you'd end up
getting a filter that follows the aphorism "when all you have is a
hammer, then everything starts to look like a nail", and your trumpets
will be played back like violins. That may work, and you may end up
getting superpositions of violins to sound like trumpets, but the
transcription will end up recording the sounds as violins being
played, which misses the whole point of the exercise.

The whole point of "extraction" is that you should leave something
behind; i.e. there should be allowance for the Sounds of Silence
(sounds that leave behind no transcription). So, ST has to have a non-
trivial kernel. Those Sounds of Silence are the "other" sounds that
are not supposed to be the violins.

So, basically, I redid the Kaiser definition by only posing the
properties that (a) S and T are generalized inverses -- STS = S and
TST = T; and (b) ST and TS are both self-adjoint -- ST = T* S* and TS
= S* T*. So, the signal space H decomposes into a part K which has
transcription and a kernel H' which contains all the Sounds of
Silence; while the measure space L^2(mu) continues to have (as it did
with Kaiser) the subspace F of transcriptions. The orthogonal
complement of F (as with Kaiser's frames) is the set of
"transcriptions" that play back as silence: they're the kernel of the
S operator and the projection operator P = TS. So, what I'm doing is
replacing Kaiser's resolution of identity ST = I with the projection
operator Q = ST for the K-subspace.

So, now with that out of the way, the basic problem is how to get the
templates accounted for. The basic strategy I'm looking at is this:
1/44 second). See how far this can carry you by actually running the
filter on the rest of the archive itself. Whatever outliers are found
are then added to the template set and the filter is remade. The
process is repeated until the whole archive is brought within a given
level of accuracy.

But the real problem is what time scale to use. If you run the time
scale too small, you end up getting a frame in the sense of Kaiser.
The 1/44 second seems like a good compromise, but I haven't tested it,
and I don't know if there's anything in the archive that has a smaller
effective time resolution. The snare drums don't seem to fall much
under 1/20 second. In particular, with the compromise I have enough
resolution for the changes in timbre for a given instrument played in
a non-steady style to register as time-dependent changes. That's the
intent, for the first stage. For the second stage of the analysis, the
intent is to extract playing style and changes in quality straight off
the transcription itself and assign these as "macros" -- both for
transcription and playback -- rather than coding them as templates.

So, the extraction filter is set up as a least squares fit. Each
template (f_1, f_2, ..., f_m) is set up with all its discrete time
translates (f_ij(t) = f_i(t - tj), tj having 1/44 second steps, and fi
possibly having a 1/22 or 1/44 second window to extract out the steady
state as a localized waveform). The linear fit for a signal g(t) is
just g(t) = sum(g^ij f_ij(t)) + dg(t), the fidelity ratio is <dg,dg>/
<g,g> and is used to test goodness of fit.
```