On Aug 2, 8:34=A0pm, Mark <federation2...@netzero.com> wrote:
> This is a followup to an article I posted a month or so back on music
> transcription. I've provided enough detail here, along with a key web
> link, to make it possible for any of you interested enough and
> equipped with the right analysis software to try out the process I'm
> about the lay out for yourself.
As mentioned: there was one critical (and ad hoc) assumption made at a
certain point in the analysis which could easily break the whole
scheme. Tests on small sample sets and waveforms only yield somewhat
sensible results where there is only one template ("monophonic
transcription"), but get serious interference when multiple templates
are running together ("polyphonic transcription").
In retrospect, the root of the problem (and a remedy) are easy to see,
as I'm about to describe. The remedy involves going back to the
discussion (involving the generalized Hilbert space frames) in the
original article that the previous one was a followup to and fully
incorporating that to replace the ad hoc assumption.
The assumption is in the right ballpark. The solution that comes out
of the analysis below is simply: (1) take the correlation with the
templates, (2) apply an equalizer filter that is derived from the
"combined spectrum" of the templates. Step (2) is a deconvolution
problem, which is an unavoidable element that lies at heart of the
transcription problem.
The fix also makes it possible to generalize the analysis and bring in
a limited degree of scale invariance, in addition to time translation
invariance. This is important for being able to model "blue notes" and
"fine tuning".
(1) Related Research
> As best as I can surmise, despite the fact that a dictionary-based
> approach is obvious, the reason for the apparent lack of motion in
> that direction is probably a very simple one: freely-available digital
> sample archives probably didn't even exist until recently.
There has, indeed, been an emphasis in the literature on the bottom-up
approach. It's also the case that a few top-down dictionary lookup
models ("tone modelling") have been discussed and developed in the
literature. A survey was done (as a Master=92s Thesis), =93Automatic
Transcription of Music=94, Anssi Klapuri, Tampere University of
Technology, which made its main issue the emphasis on the top-down
approach. A PDF copy is on the net and can be found by search.
A key reference cited therein: Tanaka Kashino. =93A sound source
separation system with the ability of automatic tone modeling=94,
Proceedings of the International Computer Music Conference, 1993 and a
followup by Nakadai Kashino, Tanaka Kinoshita, =93Application of
Bayesian probability network to music scene analysis=94. Proceedings of
the International Joint Conference on AI, CASA workshop, 1995. =93Tone
modeling=94 refers to modeling of musical instruments at different
pitches.
They *do* get limited success, which proves the feasibility of the
problem. But, emphasizing the point made above, this was in the
1990's, before digital archives became widely and freely available.
So, their "tone modelling" is limited to a few instrument models,
whereas today you can do the whole orchestra in all its nuances.
(2) The goal and analysis
To reiterate:
> The goal is ... a reversible sequencer. In the forward direction it uses
> a digital archive, [to sequence] the sound. In the reverse direction, it
> uses the archive as a dictionary to [transcribe the] sequence from the so=
und.
The problem is to express a wave form f(t) as a sum g_{aq}(t) f^{aq},
ranging over a template index (a) and time displacement (q), with
g_{aq}(t) =3D g_a(t - q).
The heuristic adopted was to take f^{aq} as the convolution involving
the dual basis g^a: f^{aq} =3D g^a * f (q). That's ad hoc and not well-
grounded.
(3) Generalized Frames
The remedy is to go back to the original discussion I laid out a month
before the previous article. What you have, with a template set, is a
FRAME
(g_{aq}(t)) as (a,q) range over some index set M.
One wants a dual frame (g^{aq}(t)) of some sort, and ideally one which
shares the same time shift property: g^{aq}(t) =3D g^a(t - q).
This leads to the idea of Hilbert space frames -- but with the added
twist that we do not require full reproduction of every signal (a
frame normally has complete reproducibility as a condition). That was
as per the original discussion.
The key transform operators here are
T f: converts f(t) to its M-space spectrum
Tf(m) =3D <g_m, f>
S h: reconstructs a function from the spectrum
Sh(t) =3D sum_m h(m) g^m(t).
When M is continuous or partly continuous, the sum gets replaced by an
integral. But I'll leave out that detail.
A similar set of operations exist with the frames' roles reversed:
S' f(m) =3D <g^m, f>
T' h(t) =3D sum_m h(m) g_m(t).
The operators S' and T' are adjoints respectively of S and T.
So, the way a frame is normally defined is that one requires complete
reproducibility. That means the operator Q =3D S T should be the
identity. So, each function f can be "transcribed" to T f and
"synthesized" to S(Tf) =3D f.
The spectra Tf do not exhaust the full range of possible functions on
M. So, there is a projection onto the valid spectra. This corresponds
to the condition that P =3D T S be a projection operator. That amount to
the same as saying that P =3D P', or T S =3D S' T'.
As a consequence, it also follows that S and T are "generalized
inverses" STS =3D S and TST =3D T.
With frames, the usual game played is to start out just with (g_m) and
try to find a dual frame (g^m) that results in the above conditions.
One usually takes the route of defining the "metric" G =3D T' T and then
noting its relation to the "dual metric" g =3DS S':
G g =3D T' T S S' =3D T' S' T' S' =3D Q' Q' =3D I
g G =3D S S' T' T =3D S T S T =3D Q Q =3D I,
which shows they're inverse. Therefore, a necessary precondition for
frames is that both the metric and its inverse be bounded above and
below by positive values:
0 < a < G < b, 0 < 1/b < g < 1/a.
Then, one can define the dual frame by g^m =3D g g_m and prove that this
produces operators S and S' that give the right results.
(4) Incomplete Frames
All of the above is what I do NOT want, however! To avoid transcribing
noise, you don't want Q to give you perfect reproducibility, but only
reproduction to a "denoised" signal. More generally, if the template
list is incomplete, you don't want to match to elements that
correspond to instruments outside the template set.
So, the conditions are relaxed by only requiring that:
(a) Q =3D Q': Q is self-adjoint (i.e. S T =3D T' S')
(b) P =3D P': P is self-adjoint (i.e. T S =3D S' T')
(c) S and T are generalized inverses: STS =3D S, TST =3D T.
When this is applied to a sample set (g_{aq}(t)) with the space M
being a combination of the template set (a in A), and real line (q in
R), then this actually leads to a definition for the dual frame:
g^{aq}(t) =3D g^a(t - q)
that is just a convolution with an equalizer filter:
g^a =3D L * g_a.
The projection Q has the functional form Q(t, t') and, as a
consequence of the conditions above, reduces to the form Q(t - t')
whose frequency components Q(nu) are all 0 or 1. It's a simple filter.
The equalizer filter is directly associated with the "combined
spectrum", which in the frequency domain is:
Gamma(nu) =3D sum_a |g_a(nu)|^2.
The relation between L, Q and Gamma is just:
L(nu) Gamma(nu) =3D Q(nu), if Gamma(nu) > 0.
L(nu) =3D 0 if Gamma(nu) =3D 0.
So, L is the "deconvolution" filter corresponding to Gamma (which is
why I write it as L -- L is an upsidedown Gamma).
This is where all the subtle nuances with difference deconvolution
methods can come into play to get something less drastic than a sharp
cutoff for L(nu) to 0 at Gamma(nu) =3D 0. The simplest possibility I was
looking at was taking the cutoff for L at Gamma(nu) < threshold. But
one could work in something involving Weiner filters or the like.
(5) Adding in Rescaling
To handle rescaling of pitches, it would also be nice to generalize
each template to a limited set of rescalings: g_{aqp}(t) =3D root(p)
g(p(t - q)), as p ranges over a small band centered on 1. For half-
note resolution, one would take p in the range (2^{-1/12}, 2^{1/12}}
or even at quarter notes (2^{-1/24}, 2^{1/24}).
The results, this time, are that the dual frame g^{aqp}(t) generally
only possess time invariance g^{aqp}(t) =3D g^{ap}(t - q), with the
functions expressible as:
g^{ap} =3D L_p * g_{ap}.
The deconvolution filter is the reciprocal of the combined spectrum
L_p Gamma_p =3D Q
Gamma_p(nu) =3D sum_a (integral |g_a(nu/p)|^2 dp/p)
where the integral is taken over the limited range assigned to p, such
as (2^{-1/12}, 2^{1/12}).
The dual g^{ap} may only take on the same rescaling form as g_{ap} if
the interval for p is, itself, scale invariant. That amount to
allowing p to take on all positive values (or all negative values, or
all real non-zero values). As I mentioned in the previous articles,
this is something you do NOT want, for music transcription. Different
instruments change their quality of sound at different pitches, so you
can't use a single function g_a to serve as a "mother wavelet" for all
the instrument pitches.
(6) Conclusions
Because of the successful results seen in related research, the
problem is already known to be feasible and is not a flat-out
impossibility. So, the next thing I'm gearing up to try is the simple
correlation + deconvolution-equalizer filter approach described in
part (4). I may end up using Weiner deconvolution in place of taking
the flat-out reciprocal.
If there are results worth posting, I'll put them out on YouTube. But,
as mentioned in the previous article, the way is also open for any of
you with the right resources to try this out for yourself. I think
there's enough detail in part (4) to allow you to set up your own
filters. The link to the London Philharmonic's sample library is in
the previous article. The London Philharmonic, apparently, was
involved in the opening ceremonies for the London Olympics a short
while ago.