Components in Audio recognition - Part 1
Audio recognition is defined as the task of recognizing a particular piece of audio (could be music, ring-tone, and speech as well), from a given sample set of audio tracks.
The Human Auditory System (HAS) is unique in that the tasks of "familiarisation" of unknown tracks, and finding "similar" tracks come naturally to us. Tunes from the not-so-recent past can still haunt the human brain many years later, when triggered by a similar tune. The way the brain stores and responds to music is proven to be different from the way the brain processes speech and other behaviour. The field of audio recognition tries to emulate this behaviour by using concepts from Biological modeling, Signal Processing theory and Pattern recognition theory.
Audio recognition systems are used mainly to retrieve similar tracks from a database - this could be for various reasons including copyright management, personal playlist management, etc. A vastly different system that relies on "Social rating" also exists, that depends on peer rating of media files to decide where they belong. This is not covered in this topic, but will be compared when required.
A typical audio recognition system consists of the following components.
- A system that "stores" the archive of tracks that need to be managed. This could be a simple SQL database indexing files stored in a 100 TB server.
- A system that "analyses" the archive and fingerprints the characteristics of each track, and form various "groups" or "sets" of track based on their overlapping characteristics. This will typically include components from modeling, signal processing, and pattern recognition fields.
- A system that can "receive" a audio track that needs to be "placed" into one of the many given groups or sets. This is typically a front-end, that is an User-Interface of some kind, followed by more Signal Processing blocks.
Portable implementations of the above can be created, with smaller storage, and more efficient but limited analysis capabilities and front ends. These can for example be used in portable media players. The Rio Volt had an early implementation of such an interface.
In the next series of articles, we will see how each of these components are typically implemented. We will also look at some reference implementations and discuss why an approach is better or bad. If you have any specific topic to discuss, email me at prabindh a't yahoo a't com.
For those of you looking at a place to start your scholarly searches, start at http://www.music-ir.org/
Looking to receive your feedbacks,
Prabindh
- Comments
- Write a Comment Select to add a comment
i am working on implementing a speech codec(ITU based), so can you help me , how to proceed and start in correct directon.
To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.
Please login (on the right) if you already have an account on this platform.
Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers: