I am designing the echo-cancelling/removal algorithm of a speakerphone device and am trying to get ideas on how to improve the performance.
I know, physics are against me, but maybe you have an idea how to work around the dominating physical problems.
Here is the problem:
Within our device there is a speaker that can be very loud. The far end of a tele-communication is output by this speaker. In addition there is a microphone insanely close to this speaker that has to pick up the (rather far away ~1m) near end signal of the communication.
Both the speaker and the mic work very well on their own. The problem is - of course - that there is a very strong disturbance (echo) in the mic signal caused by the speaker especially if it is cranked up to max volume.
I apply a nice and lean echo-cancelling algorithm that removes about 15-20dB of the echo on the mic channel which is about as good as it gets according to my cancel-a-wide-band-signal experience (wrong assumption?). Unfortunately the mic is about 1m from the speaking person and so needs to be amplified much and 20dB of echo-suppression is just not good enough.
So in addition to cancelling the echo I need to duck the mic signal when the speaker is active. If only the speaker is active this works fine. My problem is in double-talk situations. When the far end speaks, the near end is ducked and cannot interrupt.
So I need to detect double talk (both ends speaking) and remove some of my ducking in these situations. I found suggestions on the net to detect double talk by comparing the mic signal energy before and after the echo cancelling. If the energy is a lot smaller afterwards, the signal had obviously been dominated by echo (far end speaks, echo-cancelling had something to do), if it doesn't go down as much, the mic signal picks up both echo and something else (double talk) and if it doesn't go down at all, there is no echo (near end only or no-one speaking).
This double talk detection works fine if the speaker signal is only reasonably loud - but since echo-cancelling can only remove 20dB of the echo, this detection fails as soon as the echo is more than 20dB louder at the mic than the near end signal (which can happen since the loudspeaker is very close to the mic). In that case the detection above will always detect 'far-end-speaking' even if the near end speaks as well.
This is what I want to have, but don't quite know how to get:
- Somehow detect there is near end activity even if the speaker echo is very loud.
- Somehow get just a little bit of this near end signal through to the far end to give the idea of someone interrupting to the far end
- without getting echo through to the far end
In the field of acoustics, there is a concept of T60 time. This is how long an impulse takes to decay by 60 dB as a result of the reverberation within an acoustic space. For an echo canceler, you don't need to reach the T60 time, but you can choose you dB desired attenuation, and then you must span at least that much. For example, you could choose a 45 dB decay time and span that much.
Your ability to cancel an echo within that acoustic space will be directly related to how much of the decay time of the acoustic space you have covered. If you are only achieving 20 dB cancellation, and the environment (including the hardware) is low noise and linear, then you have most likely not covered the span of the acoustic space needed.
In another comment here, you indicate that you are constrained by compute power. The most common solution to this issue is to implement the cancellation in sub-bands. The reduction in compute power (compared to a straight NLMS adaptive EC) is directly proportional to the number of sub-bands into which you divide the frequency space.
There are lots of other reasons to go to sub-bands as well.
One important reason is that you can distribute your processing power where it is need. Reverberation time is much longer for low frequencies than for those in the higher range. This means that you need more reverb span for the lower frequencies to reach your desired cancellation, then you can allocate more taps to the low frequencies than to the higher ones.
Another is the ability to apply residual suppression on a frequency selective basis.
Yet another is the ability to look at the double talking problem in a divide an conquer manner.
Anyway, you should be shooting for at least 45 dB of cancellation in a clean linear environment, before you even apply residual suppression.
In one speakerphone we did at Bell Labs in the mid '90s, we started out by making sure that the system operated as a half duplex device first, and then as the canceller learned over time, we slowly reduced the amount of supplemental suppression. This kept everything stable and generally pleasant for the user experience.
Thanks for the hints!
HOW would you divide the spectrum into subbands? I tried MDCT, but the results were poorer than with the NLMS adpative filter. I also tried a block based FFT approach that also performed worse - maybe I just made a mistake... What would be your weapon of choice to perform subband EC? Your mentioning the reduction of processing power tells me it cannot be fully sampled sub-bands..
"Another is the ability to apply residual suppression on a frequency selective basis."
I tried that (with MDCT again), but that produced many artifacts. To smoothen it I had to attenuate neighbouring bands as well which kind of eliminated the benefit of the frequency selective thing.
Do you know any good papers for loudspeaker linearization?
The sub-band separation requires polyphase filtering and a DFT.
In our case, we were processing 8ksps data for 3.4 kHz audio, and we used a very odd number selection. We had data coming in 5 msec chunks (40 samples). We used 56 sub-bands of which only 23 complex needed to be processed, because we ignored close to DC and half sampling rate. Our ratio was therefore 56/40 oversampling for the polyphase separation. In our case, we brute forced the 23 point DFT, because it was easy.
The two best references I know for sub-band separation are Vadianathan's and fred harris' books on filter bank design. (@fred, notice I preserved the case :-))
On the topic of suppression, you are right about the artifacts. Great care is required in the choice of time constants for the gain adjustments in the sub-bands and the rate of attack and decay. This is more art than science. I don't know what else to say but to experiment.
Sorry, my week-end got in the way :) Thanks again for the hints. I will look into the sub-band stuff...
the "double talk detection" approach is not that appealing to me, looks like a kind of workaround solution and does not focus on the very nature of the problem.
-15/-20dB is not what I expect from a "nice and lean echo-cancelling algorithm", so I would like to know more.
Personally I think that either the algorithm is not powerful enough (what algorithm did you use, specifically?), or there is something in the contour conditions that is preventing it from working well, and that would ask for analysis.
Furthermore, you did not mention what processing power you have available, nor what portion of the processing power is currently used by this echo cancelling algorithm.
My ideal solution does NOT care about what's happening in the double talk condition, and removes probably some 40-45 dB, which should be fine.
Of course I rely on linearity (h/w), or on linearity compensation (s/w) for this.
Well, maybe it is more lean than nice :) Basically it is just an adaptive filter. Length is limited by processing power which isn't much, but in simulations the performance didn't get much better with longer filters. I think the length allows me to cancel the direct path and the first reflections for wide-band signals.
My experience with cancelling algorithms in the past was that usually you don't get much more than 20dB in real world wide-band applications. Your filter will either be too short to match the physics exactly or you make it longer and the adaption error bound will rise and/or your adaption will become very slow. For narrow band signals I have had much higher suppression rates (50dB+), but unfortunately I need to suppress 50Hz-8kHz which is a bunch of octaves.
Again, this experience may be wrong, so I am very curious: Did you ever implement ecoustic echo cancellation that removed 40-45dB of the echo in real life (not in simulation, I mean)?
And yes, linearity is a strong problem as the speaker gets louder - and it is also not that easy to compensate, since the speaker non-linearity (to my knowledge) cannot be modeled by low-order non-linear modelling.
Well yes, I got 40-45 dB attenuation (different performances at different frequencies) but there was a lot of effort on linearity compensation.
I guess that the non-linearity issue is probably cause of the difference in power, since my algorithm was a nice adaptive filter, as well - no magics...
But simulation should tell you if this is the case.
I'm not in that business, but if you would introduce a (recognizable but nearly unhearable) pattern in the voice signal, maybe you could detect that pattern in the echo (near and far) and distinguish near/far and even calculate the level of near/far end echo and the relation, and so find out how the signal is composed.
As I said, just an idea - and maybe not feasible.