More Notes on Automated Music Transcription & Reversible Sequencers

Started by Mark August 2, 2012
This is a followup to an article I posted a month or so back on music
transcription. I've provided enough detail here, along with a key web
link, to make it possible for any of you interested enough and
equipped with the right analysis software to try out the process I'm
about the lay out for yourself.

As I mentioned there, I've decided to adopt an approach to doing
dictionary-based transcription using templates derived from a digital
sample archive (namely: the only distributed by the London
Philharmonic Orchestra, plus a few add-ons to compensate for
instruments not present in their data set).

(1) Some background
The field of automated music transcription, technically considered a
species of Artificial Intelligence, does not seem to have a lot of
research literature out on it. What actually IS present is apparently
focused on a much harder problem that I intend to tackle:
transcription from zero-knowledge. My main criticism of that approach
is that it is much like sticking a 7 year old English speaker in a
German immersion program and expecting that he'll magically come out
fluent in the language (and I can tell you, first-hand, that that is
not going to work and is going to crash into a serious wall!) So,
naturally enough, much of the focus in the literature has been on the
real rock-bottom nuts and bolts primitives of simply distinguishing a
pitch or a combination of pitches.

As best as I can surmise, despite the fact that a dictionary-based
approach is obvious, the reason for the apparent lack of motion in
that direction is probably a very simple one: freely-available digital
sample archives probably didn't even exist until recently. There may
also be a research bias: if you're insistent on calling music
transcription "AI", then using a dictionary-based method may be
considered cheating.

I don't consider transcription to be AI -- at least not until you
start messing around with automated generation of instruments
themselves, and with automated parsing and processing of actual
musical language. And neither of those are what we're doing here.

(2) The goal
The goal is simple: to construct a reversible sequencer. In the
forward direction it uses a digital archive, in conjunction with a
sequencer specification, to produce the sound. In the reverse
direction, it uses the archive as a dictionary to do template-matching
to extract out a sequencer specification from the sound.

Personally, the reason to have the 2-in-1 package is this: when using
it in the reverse direction, it affords a better opportunity to use
the computer as an aid in helping to study how orchestrations are put
together. The best way to gain proficiency in the craft is directly:
analyze the actual compositions. In the forward direction it allows
one to tinker with already-existing compositions and to put together
ones' own. That, of course, is the second element of proficiency:
study and then practice.

(3) The mathematical definability of the problem
The mathematical problem of digital extraction is ambiguous. As a
consequence, it's necessary to make a certain critical assumption at a
key point. That assumption may foil the whole process. At the moment
that's a matter that remains to be determined.

In addition, there is a point where the problem reduces to a
generalized form of the ill-defined problem of reverse-convolution. In
a way, this is good. If the reverse convolution part were well-
defined, this would enable one (for instance) to devise a complete
frame comprising a small set of templates (like violins and trumpets).
Completeness, however, means that everything you pipe into the reverse
sequencer would be rendered as violins and trumpets -- even things
that are not.

Therefore, at some point, you want the analysis to have some form of
"incompleteness": it should not be possible to fully reproduce the
sounds from a given source from the set of templates you use. At bare
minimum, even apart from the issue of avoiding turning everything into
violins and trumpets, you want this as a means for noise-removal.
Noise should not be rendered as sequential compositions of actual

(4) The analysis
The basic analysis is simple ... IF you assume a linear model. Let
(g_a: a in A) be the set of templates. You want the analysis to be
time-invariant, so all functions g_{aq}(t) = g_a(t - q) are included
in that set. The decomposition for a given signal f(t) desired is then
of the form:
   f(t) = sum_{q,a} g_{aq}(t) f^{aq}
The sum over q is continuous, so in fact this reduces to a convolution
problem. Rewrite the coefficients as f^{aq} = f^a(q), then you have
   f(t) = sum_a integral g_a(t - q) f^a(q) dq = sum_a (g_a * f^a)(t).

When stated in this form, the first issue becomes plainly obvious: the
problem is ambiguous. The solutions are not unique.

So the assumption made is that the solution f^a is parallel to the
dual (g^a: a in A) of the frame (g_a: a in A) and can be written as
f^a = k * g^a.

That assumption may spoil the whole problem. As of yet, I don't know
and it's a matter I'm about to put to the test.

The dual is constructed as follows. Associated with the frame is the
   g_{ab} = <g_a, g_b> = integral g_a(t)* g_b(t) dt
where g_a(t)* denotes the complex conjugate of g_a(t). The inverse
metric g^{ab} is defined by the conditions
   sum_b g^{ab} g_{bc} = delta^a_c
where delta^a_c is the Kronecker delta (= 1 if a = c, and = 0

The dual frame is then g^a = sum_b g^{ab} g_b.

If you substitute the assumed form of the solution in the actual
expansion, you get:
   f = sum_a (g_a * g^a * k).
In the frequency domain, this becomes
   f(nu) = (sum_a g_a(nu) g^a(nu)) k(nu).
This is where the second main issue previously mentioned crops up:
incompleteness. The key element here is the combined spectrum of the
   G(nu) = sum_a g_a(nu) g^a(nu).
This need not be everywhere non-zero. Even when close to zero, it's
going to produce problems.

Therefore to arrive at a solution requires that one compromises and
adds in a filter L(t). Then the reconstruction will no longer be for
the function f(t), but its filtered version (f * L)(t).

The filter's frequency spectrum L(nu) should be zero at all the
problem points for G(nu). So, under this condition, one can take the
"filtered reciprocal" of G(nu), which could be denoted L(nu)/G(nu). In
other words, this is the deconvolution problem in play here.

Then, the solution for k(t) is just (in the frequency domain) k(nu) =
f(nu) L(nu)/G(nu) and the coefficient functions reduce to:
   f^a(nu) = f(nu) L(nu)/G(nu) g^a(nu) = f(nu) G^a(nu)
where the "extraction filter" G^a(nu) is defined by: G^a(nu) = g^a(nu)

And that's the solution we're looking for.

(5) Conclusions
The analysis is mostly simple linear algebra up to the point where we
need to find G(nu). This requires convolution and/or Fourier
transform. The definition of L(nu) reduces to a kludge -- but one
whose proper formulation will also be critical for the problem: among
other things, L(nu) has to go to 0 faster than G(nu) does, wherever
G(nu) has zeros.

Once it's in place, the filtered reciprocal and construction of the
extraction function can proceed entirely in the frequency domain
without reverting back to the time domain.

The London Philharmonic's sample archive may be found at

So, the way is also clear for any of you who have the software tools
to do the analysis to try out the process for yourself. Selecting a
template set from the archive, however, is a non-trivial part of the
process. Instruments do not scale by pitch (hence limiting the
usability of time-scale analysis and "standard" pitch templates).
Playing styles can change timbre in mid-stream. In addition, you would
have to index (or reindex, as the case may be) whatever part of the
archive you use.
On Aug 2, 8:34=A0pm, Mark <> wrote:
> This is a followup to an article I posted a month or so back on music > transcription. I've provided enough detail here, along with a key web > link, to make it possible for any of you interested enough and > equipped with the right analysis software to try out the process I'm > about the lay out for yourself.
As mentioned: there was one critical (and ad hoc) assumption made at a certain point in the analysis which could easily break the whole scheme. Tests on small sample sets and waveforms only yield somewhat sensible results where there is only one template ("monophonic transcription"), but get serious interference when multiple templates are running together ("polyphonic transcription"). In retrospect, the root of the problem (and a remedy) are easy to see, as I'm about to describe. The remedy involves going back to the discussion (involving the generalized Hilbert space frames) in the original article that the previous one was a followup to and fully incorporating that to replace the ad hoc assumption. The assumption is in the right ballpark. The solution that comes out of the analysis below is simply: (1) take the correlation with the templates, (2) apply an equalizer filter that is derived from the "combined spectrum" of the templates. Step (2) is a deconvolution problem, which is an unavoidable element that lies at heart of the transcription problem. The fix also makes it possible to generalize the analysis and bring in a limited degree of scale invariance, in addition to time translation invariance. This is important for being able to model "blue notes" and "fine tuning". (1) Related Research
> As best as I can surmise, despite the fact that a dictionary-based > approach is obvious, the reason for the apparent lack of motion in > that direction is probably a very simple one: freely-available digital > sample archives probably didn't even exist until recently.
There has, indeed, been an emphasis in the literature on the bottom-up approach. It's also the case that a few top-down dictionary lookup models ("tone modelling") have been discussed and developed in the literature. A survey was done (as a Master=92s Thesis), =93Automatic Transcription of Music=94, Anssi Klapuri, Tampere University of Technology, which made its main issue the emphasis on the top-down approach. A PDF copy is on the net and can be found by search. A key reference cited therein: Tanaka Kashino. =93A sound source separation system with the ability of automatic tone modeling=94, Proceedings of the International Computer Music Conference, 1993 and a followup by Nakadai Kashino, Tanaka Kinoshita, =93Application of Bayesian probability network to music scene analysis=94. Proceedings of the International Joint Conference on AI, CASA workshop, 1995. =93Tone modeling=94 refers to modeling of musical instruments at different pitches. They *do* get limited success, which proves the feasibility of the problem. But, emphasizing the point made above, this was in the 1990's, before digital archives became widely and freely available. So, their "tone modelling" is limited to a few instrument models, whereas today you can do the whole orchestra in all its nuances. (2) The goal and analysis To reiterate:
> The goal is ... a reversible sequencer. In the forward direction it uses > a digital archive, [to sequence] the sound. In the reverse direction, it > uses the archive as a dictionary to [transcribe the] sequence from the so=
und. The problem is to express a wave form f(t) as a sum g_{aq}(t) f^{aq}, ranging over a template index (a) and time displacement (q), with g_{aq}(t) =3D g_a(t - q). The heuristic adopted was to take f^{aq} as the convolution involving the dual basis g^a: f^{aq} =3D g^a * f (q). That's ad hoc and not well- grounded. (3) Generalized Frames The remedy is to go back to the original discussion I laid out a month before the previous article. What you have, with a template set, is a FRAME (g_{aq}(t)) as (a,q) range over some index set M. One wants a dual frame (g^{aq}(t)) of some sort, and ideally one which shares the same time shift property: g^{aq}(t) =3D g^a(t - q). This leads to the idea of Hilbert space frames -- but with the added twist that we do not require full reproduction of every signal (a frame normally has complete reproducibility as a condition). That was as per the original discussion. The key transform operators here are T f: converts f(t) to its M-space spectrum Tf(m) =3D <g_m, f> S h: reconstructs a function from the spectrum Sh(t) =3D sum_m h(m) g^m(t). When M is continuous or partly continuous, the sum gets replaced by an integral. But I'll leave out that detail. A similar set of operations exist with the frames' roles reversed: S' f(m) =3D <g^m, f> T' h(t) =3D sum_m h(m) g_m(t). The operators S' and T' are adjoints respectively of S and T. So, the way a frame is normally defined is that one requires complete reproducibility. That means the operator Q =3D S T should be the identity. So, each function f can be "transcribed" to T f and "synthesized" to S(Tf) =3D f. The spectra Tf do not exhaust the full range of possible functions on M. So, there is a projection onto the valid spectra. This corresponds to the condition that P =3D T S be a projection operator. That amount to the same as saying that P =3D P', or T S =3D S' T'. As a consequence, it also follows that S and T are "generalized inverses" STS =3D S and TST =3D T. With frames, the usual game played is to start out just with (g_m) and try to find a dual frame (g^m) that results in the above conditions. One usually takes the route of defining the "metric" G =3D T' T and then noting its relation to the "dual metric" g =3DS S': G g =3D T' T S S' =3D T' S' T' S' =3D Q' Q' =3D I g G =3D S S' T' T =3D S T S T =3D Q Q =3D I, which shows they're inverse. Therefore, a necessary precondition for frames is that both the metric and its inverse be bounded above and below by positive values: 0 < a < G < b, 0 < 1/b < g < 1/a. Then, one can define the dual frame by g^m =3D g g_m and prove that this produces operators S and S' that give the right results. (4) Incomplete Frames All of the above is what I do NOT want, however! To avoid transcribing noise, you don't want Q to give you perfect reproducibility, but only reproduction to a "denoised" signal. More generally, if the template list is incomplete, you don't want to match to elements that correspond to instruments outside the template set. So, the conditions are relaxed by only requiring that: (a) Q =3D Q': Q is self-adjoint (i.e. S T =3D T' S') (b) P =3D P': P is self-adjoint (i.e. T S =3D S' T') (c) S and T are generalized inverses: STS =3D S, TST =3D T. When this is applied to a sample set (g_{aq}(t)) with the space M being a combination of the template set (a in A), and real line (q in R), then this actually leads to a definition for the dual frame: g^{aq}(t) =3D g^a(t - q) that is just a convolution with an equalizer filter: g^a =3D L * g_a. The projection Q has the functional form Q(t, t') and, as a consequence of the conditions above, reduces to the form Q(t - t') whose frequency components Q(nu) are all 0 or 1. It's a simple filter. The equalizer filter is directly associated with the "combined spectrum", which in the frequency domain is: Gamma(nu) =3D sum_a |g_a(nu)|^2. The relation between L, Q and Gamma is just: L(nu) Gamma(nu) =3D Q(nu), if Gamma(nu) > 0. L(nu) =3D 0 if Gamma(nu) =3D 0. So, L is the "deconvolution" filter corresponding to Gamma (which is why I write it as L -- L is an upsidedown Gamma). This is where all the subtle nuances with difference deconvolution methods can come into play to get something less drastic than a sharp cutoff for L(nu) to 0 at Gamma(nu) =3D 0. The simplest possibility I was looking at was taking the cutoff for L at Gamma(nu) < threshold. But one could work in something involving Weiner filters or the like. (5) Adding in Rescaling To handle rescaling of pitches, it would also be nice to generalize each template to a limited set of rescalings: g_{aqp}(t) =3D root(p) g(p(t - q)), as p ranges over a small band centered on 1. For half- note resolution, one would take p in the range (2^{-1/12}, 2^{1/12}} or even at quarter notes (2^{-1/24}, 2^{1/24}). The results, this time, are that the dual frame g^{aqp}(t) generally only possess time invariance g^{aqp}(t) =3D g^{ap}(t - q), with the functions expressible as: g^{ap} =3D L_p * g_{ap}. The deconvolution filter is the reciprocal of the combined spectrum L_p Gamma_p =3D Q Gamma_p(nu) =3D sum_a (integral |g_a(nu/p)|^2 dp/p) where the integral is taken over the limited range assigned to p, such as (2^{-1/12}, 2^{1/12}). The dual g^{ap} may only take on the same rescaling form as g_{ap} if the interval for p is, itself, scale invariant. That amount to allowing p to take on all positive values (or all negative values, or all real non-zero values). As I mentioned in the previous articles, this is something you do NOT want, for music transcription. Different instruments change their quality of sound at different pitches, so you can't use a single function g_a to serve as a "mother wavelet" for all the instrument pitches. (6) Conclusions Because of the successful results seen in related research, the problem is already known to be feasible and is not a flat-out impossibility. So, the next thing I'm gearing up to try is the simple correlation + deconvolution-equalizer filter approach described in part (4). I may end up using Weiner deconvolution in place of taking the flat-out reciprocal. If there are results worth posting, I'll put them out on YouTube. But, as mentioned in the previous article, the way is also open for any of you with the right resources to try this out for yourself. I think there's enough detail in part (4) to allow you to set up your own filters. The link to the London Philharmonic's sample library is in the previous article. The London Philharmonic, apparently, was involved in the opening ceremonies for the London Olympics a short while ago.