comp.dsp | DSP question - Bandwidth Extension

Here's what I have:

1. Acoustic model (CMU Sphinx) to be used in a keyword spotter. Trained for=
 speech sampled at 16kHz and performs well. Doesn't perform well when prese=
nted with a speech signal sampled at 8kHz or a speech signal with max bandw=
idth of 4kHz and sample rate =3D 16kHz.

2. A microphone which only delivers a narrow-band signal. The bandwidth of =
the signal is max 4kKz. I can set the sample rate (audio driver API) to 16k=
Hz, but the bandwidth remains the same since the underlying
HW samples at 8kHz. Can't change that!

Here's the result:

The keyword spotter fails when it's presented with a speech signal (sample =
rate 16kHz) which only has
a bandwidth of 4kHz.

Here's my question:
Would it be reasonable to expect that the keyword spotter will work if I "f=
ake it" by bandwidth=20
extending the narrowband signal prior to sending it to the keyword spotter?

What is the simplest BW-extender ? (I'm looking for something which can be =
implemented fast).

Thanks

Reply by Steve Pope ●July 15, 20162016-07-15

Mauritz Jameson  <mjames2393@gmail.com> wrote:

>Here's what I have:

>1. Acoustic model (CMU Sphinx) to be used in a keyword spotter. Trained
>for speech sampled at 16kHz and performs well. Doesn't perform well when
>presented with a speech signal sampled at 8kHz or a speech signal with
>max bandwidth of 4kHz and sample rate = 16kHz.

Can it be re-trained with the lower-bandwidth speech?  There should
not me much keyword information above 4 KHz, for most languages
including English.  There are no formants up that high, or if there
are, they have little semantic meaning.

>Here's my question:
>Would it be reasonable to expect that the keyword spotter will work if I
>"fake it" by bandwidth 
>extending the narrowband signal prior to sending it to the keyword spotter?

>What is the simplest BW-extender ? (I'm looking for something which can
>be implemented fast).

Seems like a shot in the dark to me.  You might try an envelope
follower, driven by high-pass filtered noise about 4 KHz, and 
controlled by the source signal in the 3 to 4 KHz range.

Re-training is cleaner.

Good luck...

Steve

Reply by Mauritz Jameson ●July 15, 20162016-07-15

> Can it be re-trained with the lower-bandwidth speech? =20

That is the obvious and correct solution, but the answer in this particular=
 case (even though it would be possible given time and money) is no.

>You might try an envelope follower, driven by high-pass filtered noise abo=
ut 4 > KHz, and controlled by the source signal in the 3 to 4 KHz range.

Is there any literature on that kind of approach? Or else I'm not entirely =
sure I understand what you mean. Are you saying:

1. Resample from 8kHz to 16kHz
2. Estimate spectrum envelope from 4kHz to 8kHz
3. Create a high pass filter (cut-off around 4kHz)
4. ????

What is driving the filter (to get an output) ? "Source signal" as defined =
in LPC analysis? Or "source signal" as in the speech signal? Basically, I n=
eed to know the best/fastest way to estimate what the upper band (4kHz - 8k=
Hz) should look like given a narrow-band speech signal.=20


> Re-training is cleaner.

I totally agree!

Reply by Steve Pope ●July 15, 20162016-07-15

Mauritz Jameson  <mjames2393@gmail.com> wrote:

>> Can it be re-trained with the lower-bandwidth speech?  

>That is the obvious and correct solution, but the answer in this
>particular case (even though it would be possible given time and money)
>is no.

Right.

>>You might try an envelope follower, driven by high-pass filtered noise
>about 4 > KHz, and controlled by the source signal in the 3 to 4 KHz
>range.

>Is there any literature on that kind of approach? 

No

> Or else I'm not entirely sure I understand what you mean. Are you saying:

>1. Resample from 8kHz to 16kHz
>2. Estimate spectrum envelope from 4kHz to 8kHz
>3. Create a high pass filter (cut-off around 4kHz)
>4. ????

1. Resample from 8kHz to 16kHz, with suitable anti-alias filtering
so that there is no power from 4KHz to 8KHz

2. Estimate the envelope of the power from 3 KHz - 4 KHz
3. Create a noise like signal with the envelope from (2) and
with power from 4 KHz to 8 KHz; add this signal to the original.

>> Re-training is cleaner.
>
>I totally agree!

S.

Reply by ●July 15, 20162016-07-15

On Thursday, July 14, 2016 at 11:46:19 PM UTC-7, Mauritz Jameson wrote:
> > Can it be re-trained with the lower-bandwidth speech?  

> That is the obvious and correct solution, but the answer in this particular case
>  (even though it would be possible given time and money) is no.

I think you need to explain the problem in more detail.  

Why exactly can't it be retrained?   It matters.

It might be that you can make some assumptions on harmonics that go
above 4kHz given a voice signal below.  Those assumptions depend in 
unobvious ways on the previous assumptions.

Overfitting is a common problem in a variety of machine learning systems.

Normally, overfitting is bad, in that the resultant system doesn't perform
properly, but it sounds like your does given 8kHz bandwidth input.

But even so, it would seem that it trained on those, when we know
it shouldn't have.  (Note that 8kHz sampling rate is common for telephone
systems, so if you expect to work with telephone signals, that is what
you will get.)

It might be (just guessing) that every frequency from 2kHz to 4kHz has a
second harmonic of lower amplitude above 4kHz.  Many systems only
have odd harmonics, so maybe it is from 1.33kHz to 2.66kHz third
harmonics.  If your system trained on those, it will fail without them,
even if all voices are consistent in those harmonics.

Or maybe you can go back and figure out what the trained system
would have trained on given 4kHz input.  This depends on knowing more
details than I can think of, though.

> > You might try an envelope follower, driven by high-pass filtered noise about
> >  4 > KHz, and controlled by the source signal in the 3 to 4 KHz range.

> Is there any literature on that kind of approach? 
> Or else I'm not entirely sure I understand what you mean. Are you saying:

(snip)

There is a story about a neural network system that was trained, and
did very well, at predicting the credit worthiness of borrowers. 
It turns out, though, that according to the law when you deny credit
you have to explain why.  "My neural network said so" isn't good enough
in the legal sense.

If your system is neural net based, I suspect it will be hard to figure out.

-- glen

Reply by ●July 15, 20162016-07-15

On Friday, July 15, 2016 at 1:01:00 PM UTC-7, Steve Pope wrote:
> Mauritz Jameson  <mjames...@gmail.com> wrote:

(snip)
> >That is the obvious and correct solution, but the answer in this
> >particular case (even though it would be possible given time and money)
> >is no.

(snip)

> 1. Resample from 8kHz to 16kHz, with suitable anti-alias filtering
> so that there is no power from 4KHz to 8KHz

> 2. Estimate the envelope of the power from 3 KHz - 4 KHz
> 3. Create a noise like signal with the envelope from (2) and
> with power from 4 KHz to 8 KHz; add this signal to the original.

4. Using known 4kHz input signals, and the above system, test it out.

5. Adjust envelope parameters in some way.

6. Go to step 3, until an optimal result is obtained.

I am guessing that second and third harmonics of the input are
the main contributors, but that is really a guess.

Reply by ●July 16, 20162016-07-16

The fundamental frequency of an adult's voice is in the low hundreds of Hz,=
 so it's probably not the harmonics of voiced sounds that are the issue.

Sounds like S, SH, F, TH, as well as shorter consonants, will have lots of =
noisy content above 4kHz, and how much will depend on the phoneme. If you r=
eally want to give the model what it's expecting you'd have to figure out w=
hat sound it's supposed to be and resynthesize appropriately.

As far as cheap attempts at a solution go, you could take your 8kHz signal =
and resample to 16kHz by inserting zeroes. This will just reflect all of th=
e energy around 4kHz, and will sound awful, but who knows, maybe the model =
will like it better than a correct resampling. You could also refine this a=
pproach: maybe leave the content below 2kHz alone, but duplicate the 2-4kHz=
 band into one or both of the 4-6 and 6-8kHz bands.

Good luck!

-Ethan

Reply by Steve Pope ●July 16, 20162016-07-16

<ethan@polyspectral.com> wrote:

>The fundamental frequency of an adult's voice is in the low hundreds of
>Hz, so it's probably not the harmonics of voiced sounds that are the
>issue.
>
>Sounds like S, SH, F, TH, as well as shorter consonants, will have lots
>of noisy content above 4kHz, and how much will depend on the phoneme. If
>you really want to give the model what it's expecting you'd have to
>figure out what sound it's supposed to be and resynthesize
>appropriately.

Agree

>As far as cheap attempts at a solution go, you could take your 8kHz
>signal and resample to 16kHz by inserting zeroes. This will just reflect
>all of the energy around 4kHz, and will sound awful, but who knows,
>maybe the model will like it better than a correct resampling. You could
>also refine this approach: maybe leave the content below 2kHz alone, but
>duplicate the 2-4kHz band into one or both of the 4-6 and 6-8kHz bands.

Yes, the idea is to goose the model with something above 4 KHz,
whether or not it makes sense as synthetic speech.

Steve

Reply by ●July 18, 20162016-07-18

On Friday, July 15, 2016 at 2:57:55 PM UTC+12, Mauritz Jameson wrote:
> Here's what I have:
>=20
> 1. Acoustic model (CMU Sphinx) to be used in a keyword spotter. Trained f=
or speech sampled at 16kHz and performs well. Doesn't perform well when pre=
sented with a speech signal sampled at 8kHz or a speech signal with max ban=
dwidth of 4kHz and sample rate =3D 16kHz.
>=20
> 2. A microphone which only delivers a narrow-band signal. The bandwidth o=
f the signal is max 4kKz. I can set the sample rate (audio driver API) to 1=
6kHz, but the bandwidth remains the same since the underlying
> HW samples at 8kHz. Can't change that!
>=20
> Here's the result:
>=20
> The keyword spotter fails when it's presented with a speech signal (sampl=
e rate 16kHz) which only has
> a bandwidth of 4kHz.
>=20
> Here's my question:
> Would it be reasonable to expect that the keyword spotter will work if I =
"fake it" by bandwidth=20
> extending the narrowband signal prior to sending it to the keyword spotte=
r?
>=20
> What is the simplest BW-extender ? (I'm looking for something which can b=
e implemented fast).
>=20
> Thanks

Get the FEDs to do their own dirty work.

DSP question - Bandwidth Extension

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group