DSPRelated.com
Forums

Post-doc in Acoustic Event Detection (US citizens only)

Started by Unknown August 24, 2006
tony.nospam@nospam.tonyRobinson.com wrote:
> "Rune Allnor" <allnor@tele.ntnu.no> writes: > > You've lost me. > > I think that's because I'm discussing the meaning of a job ad and you're > discussing a time in your life when you had some very bad experiences.
The reason why I got those very bad experiences was that I, before I signed on to the various jobs, didn't pay attention to the kind of details we are discussing regarding this job ad: "Did [my later to become boss] really understand the subject we discussed at the interview?" The answer turned out to be "No!" with very bad consequences for me. I was the person assigned to implement what a layman had specfied in as roundabout terms as in this job ad; the customer expected that contract to be fulfilled verbatim -- I took all the impact when the project blew up. The next job: I read the job description that was OK in general, but a bit hazy in the subjects I was assigned to work with. All the hardware and experimental stuff was no problem; competent people handled that. "These people can't really be *that* incompetent on processing" I thought, "the processing is the key here so there are probably confidentiality issues here." and signed. No, no such confidentiality existed. Yes, the people really did not have the faintest clue as to how to process the data. Again, the project blew sky high and I took the impact from the customer. As I have said numerous times: This job ad carries all the indicators of the people behind it paying as much attention to detail and planning as the people that got me into serious trouble. Rune
tony.nospam@nospam.tonyRobinson.com wrote:
> "Rune Allnor" <allnor@tele.ntnu.no> writes: > > > I am guessing now -- I can't read the ad's author's mind so I don't > > know what he or she intended to say -- but it might have been better > > to write something like "match recorded speach contaminated by > > noise to a multilingual database" [*]. It might cover the bases, and if > > > > it does, it would let any serious applicant see that the author knows > > his business. It would also cover the institution for glitches: > > A "multilingual database" has a finite extent, depending to languages > > of interest and available resources in general. One miss may be > > explained by whatever target not being covered by the database. > > Which is not at all an unlikely occurence. > > > > "All possible words" leaves no such room for error, opening for > > all sorts of onslaughts by a customer, should a miss ever occur. > > So your problem is one of coverage? I see no claim that the target has > to occur in any database.
So how are you going to find it?
> Just to clarify, this is what we are talking about "the detection of one > particular target word and/or sound in the background of all other > possible words or any other realistic sounds." > > Let's set this up as a conventional pattern matching problem. We have > a unknown pattern, A, and a database of previously seen patterns. We > have a function that returns a score between A and every item in the > database. Are you comfortable with this framework?
Yes.
> We'll represent all patterns as digital waveforms with sufficient > resolution and sampling frequency to reasonably represent any sound. > Are we okay to this point or do you want to argue that you can't > represent every possible sound in such a digital format?
To within sampling bandwidth and quantization effects, yes.
> Now we populate our database with sounds, it might be a parrot saying "I > know what Ambiguous means!", a computer synthesising "Usenet ranting is > a waste of time" or anything else. The patterns can be any sound which > convers all possible words. Are you okay with this or would you like to > argue that you'll need an infinite database to store all possible words?
The signal can be any sound, then. Whether it is a word or anything else is unspecified, what I am conserned.
> We run our matching algorithm, it comes up with a result, perhaps it's > good, perhaps not. Are you happy with this or do you expect perfect > detection?
I can't see what relevance this matching has to anything useful. Any matching algorithm will produce a result on any data. Whether the match makes sense or is useful, depends on whether the processing algorithm is based on a representation that matches the data. I used to work frequency estimators like MUSIC and ESPRIT. They pop out answers to numerical precision where standard Fourier transforms have quantized resolution. The frequency estimators worked very well AS LONG AS the data were reasonably well represented by the sum-of-sines model the mentioned frequency estimators were based on. Would you like to guess what happened when I tried the same methods with not-so-model-compliant data? The frequency estimators still popped out answers to within numerical precision. The only snag was that the numbers they produced had no relation whatsoever with the spectra as computed with the Fourier transform. Now, the key here is that there were nothing in the numbers the frequency estimators produced that warned me about a non-data-compilant model. The only reason why I learned this behaviour, was that I expected such an effect to happen, and compared them to DFT data. What makes you think you can find such an elusive signal structure as a "word" when DSP methods that have been researched extensively for 30 years can not tell whether or not a signal segment really contains such a well-defined structure as a monocromatic sinusoidal.
> We now degrade the database by embedding the patterns in other patterns. > No doubt the accuracy will get worse. Do you have a problem with that?
No.
> Sure, the difficulty of the problem depends on the signal to noise ratio > in the database, as it's unspecified you could argue that the task is > impossible - do you want to do that?
Signal to noise ratio just a minor issue here. SNR studies usually concern themselves with white noise. That will be an issue here, but it will not be significant compared to some of the other stuff. What you are proposing, is to detect one "wanted" sound -- which seems to be more or less well specified specified -- among a set of "unwanted" sounds that bear similar characteristics to the "wanted" sound. This seems to be very similar what is otherwise known as "signal-correlated noise". It's usually a killer for most systems' performance. In SONAR, which I am familiar with, it is no use trying to overcome reverbration by increasing the source level. Doing that only increases the reverbration as well, so you are stuck. Ity's a trivial argument.
> My interpretation of the job ad is to build a better pattern matching > algorithm so that you can find one sound in the background of other > sounds. Do we agree?
Seems to be similar to that all-time popular task of extracting one instrument form a recording of an orchestra. All people want to achieve in that setting, is to separate one melody or line of tunes from some others. I don't know that anybody have achieved it.
> You've clearly got a problem matching one sound against all other > possible sounds based on your past experience, I just don't see how that > relates to this job ad.
OK, define a "word". Does a human have to utter the sound, or is it sufficient that the parrot utters it? Is everything a human utters a "word"? Having defined the "word" somehow, how do you separate a "word" in a recording from a mixture involving other types of sounds? Apart from that, you might find it interesting to read up on Weierstrass' (sp?) representation theorem. It basically says that any data sequence can be represented arbitrarily well by any set of linearly independent basis functions. So basically, you can take any signal and match it to any of your sounds in your library. Subtract the template sound that matches best, and remove the template from the library. Then match the remains of the signal against the remains of the library. Some other sound will be the one that matches best. Subtract this from the signal and remove the template from the library. Repeat this mach - subtract - remove process until either the residual == 0 or you have compared the signal with the whole library. I'll almost guarantee that you still have a non-zero residual by the time you run out of templates. The Weierstrass theorem is a real killer for most "bright" signal analysis ideas based on template matching. Rune
tony.nospam@nospam.tonyRobinson.com wrote:

   ...

> Now we populate our database with sounds, it might be a parrot saying "I > know what Ambiguous means!", a computer synthesising "Usenet ranting is > a waste of time" or anything else. The patterns can be any sound which > convers all possible words. Are you okay with this or would you like to > argue that you'll need an infinite database to store all possible words?
The ad reads "... detection of one particular target word and/or sound in the background of all other possible words or any other realistic sounds." "All other possible words or any other realistic sounds" (including machine-gun fire, stampeding cattle, train wrecks and more) sounds pretty infinite to me. Whoever wrote that cannot possibly mean it. Giving the author the benefit of good will, we must conclude careless expression as the least possible fault. ... Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
> > Seems to be similar to that all-time popular task of extracting one > instrument form a recording of an orchestra. All people want to > achieve in that setting, is to separate one melody or line of tunes > from some others. > > I don't know that anybody have achieved it. >
Rune, In our task the "background" words and noises are interspersed with the target word, not overlapping with it. Using the word "background" to refer to interspersed signals is common in the keyword spotting field. Since this ad was aimed at a broader audience than that field, we should have clarified our language. I have to take some responsibility for not catching this before posting and I sincerely apologize for the confusion. Regards, David
"Rune Allnor" <allnor@tele.ntnu.no> writes:

> tony.nospam@nospam.tonyRobinson.com wrote: > > "Rune Allnor" <allnor@tele.ntnu.no> writes: > > > You've lost me. > > > > I think that's because I'm discussing the meaning of a job ad and you're > > discussing a time in your life when you had some very bad experiences. > > The reason why I got those very bad experiences was that I, before > I signed on to the various jobs, didn't pay attention to the kind of > details we are discussing regarding this job ad: "Did [my later to > become boss] really understand the subject we discussed at the > interview?" The answer turned out to be "No!" with very bad > consequences for me. I was the person assigned to implement > what a layman had specfied in as roundabout terms as in this job ad; > the customer expected that contract to be fulfilled verbatim -- I took > all the impact when the project blew up. > > The next job: I read the job description that was OK in general, > but a bit hazy in the subjects I was assigned to work with. All the > hardware and experimental stuff was no problem; competent people > handled that. "These people can't really be *that* incompetent on > processing" I thought, "the processing is the key here so there are > probably confidentiality issues here." and signed. No, no such > confidentiality existed. Yes, the people really did not have the > faintest clue as to how to process the data. Again, the project > blew sky high and I took the impact from the customer. > > As I have said numerous times: This job ad carries all the > indicators of the people behind it paying as much attention > to detail and planning as the people that got me into serious > trouble.
Sure, if you are signing a legal agreement you've got to understand what you are signing. However, this job ad isn't a legal agreement and we don't know the details of employment. Your objection seems mostly to be based on the premise that the employer has a task that can't be realistically accomplished within the project and that the hired researcher will take the can. I see no evidence for this position. Indeed, from what I know of ICSI I find this position very very unlikely. Tony
"Rune Allnor" <allnor@tele.ntnu.no> writes:

> tony.nospam@nospam.tonyRobinson.com wrote: > > "Rune Allnor" <allnor@tele.ntnu.no> writes: > > > > > I am guessing now -- I can't read the ad's author's mind so I don't > > > know what he or she intended to say -- but it might have been better > > > to write something like "match recorded speach contaminated by > > > noise to a multilingual database" [*]. It might cover the bases, and if > > > > > > it does, it would let any serious applicant see that the author knows > > > his business. It would also cover the institution for glitches: > > > A "multilingual database" has a finite extent, depending to languages > > > of interest and available resources in general. One miss may be > > > explained by whatever target not being covered by the database. > > > Which is not at all an unlikely occurence. > > > > > > "All possible words" leaves no such room for error, opening for > > > all sorts of onslaughts by a customer, should a miss ever occur. > > > > So your problem is one of coverage? I see no claim that the target has > > to occur in any database. > > So how are you going to find it?
Who says that you are going to? I think it perfectly reasonable that you output a confidence that the target word appears in every database entry.
> > Now we populate our database with sounds, it might be a parrot saying "I > > know what Ambiguous means!", a computer synthesising "Usenet ranting is > > a waste of time" or anything else. The patterns can be any sound which > > convers all possible words. Are you okay with this or would you like to > > argue that you'll need an infinite database to store all possible words? > > The signal can be any sound, then. Whether it is a word or anything > else is unspecified, what I am conserned.
Yes, I've generalised to any sound so that you're not concerned with the compexities of some language that is not available to the researcher. Let's keep things simple if we can.
> > We run our matching algorithm, it comes up with a result, perhaps it's > > good, perhaps not. Are you happy with this or do you expect perfect > > detection? > > I can't see what relevance this matching has to anything useful. > Any matching algorithm will produce a result on any data. Whether > the match makes sense or is useful, depends on whether the > processing algorithm is based on a representation that > matches the data.
True enough. I'm not arguing for any particular algorithm, just that it's reasonable to allow for any sound to be the target and any sound to be in the database as background. This seemed to be one of your main objections.
> What makes you think you can find such an elusive > signal structure as a "word" when DSP methods that > have been researched extensively for 30 years can not > tell whether or not a signal segment really contains such > a well-defined structure as a monocromatic sinusoidal.
It's easy to add sufficient noise to any pattern matching problem such that the probability of detection becomes vanishingly small. This is as true of a target word in background noise as it is of a single sinusiod in noise. This just makes it a harder task and more iteresting for some people - not an irresponsibly worded advert or indication of evil employers out to depress new PhDs into abandoning their careers.
> > You've clearly got a problem matching one sound against all other > > possible sounds based on your past experience, I just don't see how that > > relates to this job ad. > > OK, define a "word". Does a human have to utter the sound, or is it > sufficient that the parrot utters it? Is everything a human utters a > "word"? > Having defined the "word" somehow, how do you separate a "word" in a > recording from a mixture involving other types of sounds?
This is just irrelevent detail, I still don't know why you are hung up on it. We don't know the details of the project but it's reasonable to assume we've got an instance of the target "word", so it's just a set of acoutic vectors. We can build a model of the variation of these that we may reasonably see in the database. We can build models of the background noise in the database. We can combine both models and perform a match. If you want details, one representation would be to use FFT poweres binned on a perceptual scale and hidden Markov models for the target and background. The target and background models can be combined by multiplying out the states (parallel model combination) and assuming the target and background are uncorrelated the observations powers sum as do the variances. You can then do a Viterbi alignment and look for the log liklihood difference between your target word occuring and not occurring. I'm not saying this is the best method, just it's the first that comes to mind, and perhaps the one I'd start with if I was a contractor tasked with this, or if I was supervising a research student/research fellow in this area.
> Apart from that, you might find it interesting to read up on > Weierstrass' (sp?) representation theorem. It basically says that > any data sequence can be represented arbitrarily well by any > set of linearly independent basis functions.
Sounds perfectly reasonable to me.
> So basically, you can take any signal and match it to any > of your sounds in your library. Subtract the template sound that > matches best, and remove the template from the library. Then > match the remains of the signal against the remains of the library. > Some other sound will be the one that matches best. Subtract > this from the signal and remove the template from the library. > Repeat this mach - subtract - remove process until either the > residual == 0 or you have compared the signal with the whole > library. I'll almost guarantee that you still have a non-zero residual > by the time you run out of templates. > > The Weierstrass theorem is a real killer for most "bright" > signal analysis ideas based on template matching.
Well, in this case I wouldn't expect the target and database entries to be in phase, so simple subraction wouldn't get you anywhere. But your point seems to be that you can match any target to any template to a certain degree and from this you draw the concluson that the task is impossible. That's not a valid step, all that we need to do is establish a degree of confidence that the target is embedded in the background, and that is certainly possible to do. Tony
Jerry Avins <jya@ieee.org> writes:

> tony.nospam@nospam.tonyRobinson.com wrote: > > ... > > > Now we populate our database with sounds, it might be a parrot saying "I > > know what Ambiguous means!", a computer synthesising "Usenet ranting is > > a waste of time" or anything else. The patterns can be any sound which > > convers all possible words. Are you okay with this or would you like to > > argue that you'll need an infinite database to store all possible words? > > The ad reads "... detection of one particular target word and/or sound > in the background of all other possible words or any other realistic > sounds." "All other possible words or any other realistic sounds" > (including machine-gun fire, stampeding cattle, train wrecks and more) > sounds pretty infinite to me. Whoever wrote that cannot possibly mean > it. Giving the author the benefit of good will, we must conclude > careless expression as the least possible fault.
Why not? Humans can detect a particular word and/or sound in the background of all other possible words or any other realistic sounds. What's so wrong about researching the machine version? Sure, sometimes the detection rate will be high, sometimes low, and for now it's probably not going to be as good as humans do it - that's no reason for not putting research effort into this field. Tony
Tony Robinson wrote:
> Jerry Avins <jya@ieee.org> writes: > >> tony.nospam@nospam.tonyRobinson.com wrote: >> >> ... >> >>> Now we populate our database with sounds, it might be a parrot saying "I >>> know what Ambiguous means!", a computer synthesising "Usenet ranting is >>> a waste of time" or anything else. The patterns can be any sound which >>> convers all possible words. Are you okay with this or would you like to >>> argue that you'll need an infinite database to store all possible words? >> The ad reads "... detection of one particular target word and/or sound >> in the background of all other possible words or any other realistic >> sounds." "All other possible words or any other realistic sounds" >> (including machine-gun fire, stampeding cattle, train wrecks and more) >> sounds pretty infinite to me. Whoever wrote that cannot possibly mean >> it. Giving the author the benefit of good will, we must conclude >> careless expression as the least possible fault. > > Why not? Humans can detect a particular word and/or sound in the > background of all other possible words or any other realistic sounds. > What's so wrong about researching the machine version? Sure, > sometimes the detection rate will be high, sometimes low, and for now > it's probably not going to be as good as humans do it - that's no > reason for not putting research effort into this field.
Somehow, I didn't think that the ad writer meant to include the sounds of a boiler factory or battlefield in "all realistic sounds", but that may be because I've been trained by experience to expect hyperbole. Bashful people say "I wouldn't mind if you ..." when they mean "Please ...". I have a jar of mayonnaise whose label declares "Made with the goodness of canola oil". I know how many calories canola oil has. How many are in its goodness? Just about every car ad for the past few years has been the product of trick photography and digital manipulation. I must assume that the idiots who produce them think I'm eager to buy from a master of deception. Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
Jerry Avins <jya@ieee.org> writes:

> Tony Robinson wrote: > > Jerry Avins <jya@ieee.org> writes: > > > >> tony.nospam@nospam.tonyRobinson.com wrote: > >> > >> ... > >> > >>> Now we populate our database with sounds, it might be a parrot saying "I > >>> know what Ambiguous means!", a computer synthesising "Usenet ranting is > >>> a waste of time" or anything else. The patterns can be any sound which > >>> convers all possible words. Are you okay with this or would you like to > >>> argue that you'll need an infinite database to store all possible words? > >> The ad reads "... detection of one particular target word and/or sound > >> in the background of all other possible words or any other realistic > >> sounds." "All other possible words or any other realistic sounds" > >> (including machine-gun fire, stampeding cattle, train wrecks and more) > >> sounds pretty infinite to me. Whoever wrote that cannot possibly mean > >> it. Giving the author the benefit of good will, we must conclude > >> careless expression as the least possible fault. > > Why not? Humans can detect a particular word and/or sound in the > > background of all other possible words or any other realistic sounds. > > What's so wrong about researching the machine version? Sure, > > sometimes the detection rate will be high, sometimes low, and for now > > it's probably not going to be as good as humans do it - that's no > > reason for not putting research effort into this field. > > Somehow, I didn't think that the ad writer meant to include the sounds > of a boiler factory or battlefield in "all realistic sounds", but that > may be because I've been trained by experience to expect hyperbole.
As I said earlier on, I don't see why it's so impossible to consider a distribution over all possible sounds. I'm by no means advocating that you can give all possible sounds equal weighting, I don't think that defines a meanful distribution. But there are many possible distributions you can consider - for example it's not impossible for many people to carry around a sound recoder 24 hours a day for say a year and then use that. There's certainly a lot of cynicism and distrust here, I'm sure that this is unfounded with respect to this job ad. Tony
Tony Robinson wrote:

   ...

> As I said earlier on, I don't see why it's so impossible to consider a > distribution over all possible sounds. I'm by no means advocating that > you can give all possible sounds equal weighting, I don't think that > defines a meanful distribution. But there are many possible > distributions you can consider - for example it's not impossible for > many people to carry around a sound recoder 24 hours a day for say a > year and then use that.
Ia far as I can see, it is not possible to catalog all possible sounds, to say nothing about testing them.
> There's certainly a lot of cynicism and distrust here, I'm sure that > this is unfounded with respect to this job ad.
No distrust here, only sadness and disgust that supposedly educated people so often fail to avoid ambiguity in what they write even after careful consideration of the text. When the reader is left to fill in gaps or discard untenable interpretations, it becomes very unlikely that writer and reader will see all details in the same light. Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;