Towards Efficient and Robust Automatic Speech Recognition: Decoding Techniques and Discriminative Training

Janne Pylkkönen

Automatic speech recognition has been widely studied and is already being applied in everyday use. Nevertheless, the recognition performance is still a bottleneck in many practical applications of large vocabulary continuous speech recognition. Either the recognition speed is not sufficient, or the errors in the recognition result limit the applications. This thesis studies two aspects of speech recognition, decoding and training of acoustic models, to improve speech recognition performance in different conditions.

Biosignal processing challenges in emotion recognition for adaptive learning

Aniket Vartak

User-centered computer based learning is an emerging field of interdisciplinary research. Research in diverse areas such as psychology, computer science, neuroscience and signal processing is making contributions to take this field to the next level. Learning systems built using contributions from these fields could be used in actual training and education instead of just laboratory proof-of-concept. One of the important advances in this research is the detection and assessment of the cognitive and emotional state of the learner using such systems. This capability moves development beyond the use of traditional user performance metrics to include system intelligence measures that are based on current theories in neuroscience. These advances are of paramount importance in the success and wide spread use of learning systems that are automated and intelligent. Emotion is considered an important aspect of how learning occurs, and yet estimating it and making adaptive adjustments are not part of most learning systems. In this research we focus on one specific aspect of constructing an adaptive and intelligent learning system, that is, estimation of the emotion of the learner as he/she is using the automated training system. The challenge starts with the definition of the emotion and the utility of it in human life. The next challenge is to measure the co-varying factors of the emotions in a non-invasive way, and find consistent features from these measures that are valid across wide population. In this research we use four physiological sensors that are non-invasive, and establish a methodology of utilizing the data from these sensors using different signal processing tools. A validated set of visual stimuli used worldwide in the research of emotion and attention, called International Affective Picture System (IAPS), is used. A dataset is collected from the sensors in an experiment designed to elicit emotions from these validated visual stimuli. We describe a novel wavelet method to calculate hemispheric asymmetry metric using electroencephalography data. This method is tested against typically used power spectral density method. We show overall improvement in accuracy in classifying specific emotions using the novel method. We also show distinctions between different discrete emotions from the autonomic nervous system activity using electrocardiography, electrodermal activity and pupil diameter changes. Findings from different features from these sensors are used to give guidelines to use each of the individual sensors in the adaptive learning environment.

Fully Programmable LDPC Decoder Hardware Architectures

Christiane Maja Beusche

In recent years, the amount of digital data which is stored and transmitted for private and public usage has increased considerably. To allow a save transmission and storage of data despite of error-prone transmission media, error correcting codes are used. A large variety of codes has been developed, and in the past decade low-density parity-check (LDPC) codes which have an excellent error correction performance became more and more popular. Today, low-density parity-check codes have been adopted for several standards, and efficient decoder hardware architectures are known for the chosen structured codes. However, the existing decoder designs lack flexibility as only few structured codes can be decoded with one decoder chip. In consequence, different codes require a redesign of the decoder, and few solutions exist for decoding of codes which are not quasi-cyclic or which are unstructured. In this thesis, three different approaches are presented for the implementation of fully programmable LDPC decoders which can decode arbitrary LDPC codes. As a design study, the first programmable decoder which uses a heuristic mapping algorithm is realized on an field-programmable gate array (FPGA), and error correction curves are measured to verify the correct functionality. The main contribution of this thesis lies in the development of the second and the third architecture and an appropriate mapping algorithm. The proposed fully programmable decoder architectures use one-phase message passing and layered decoding and can decode arbitrary LDPC codes using an optimum mapping and scheduling algorithm. The presented programmable architectures are in fact generalized decoder architectures from which the known decoders architectures for structured LDPC codes can be derived.

Efficient arithmetic for high speed DSP implementation on FPGAs

Steven W. Alexander

The author was sponsored by EnTegra Ltd, a company who develop hardware and software products and services for the real time implementation of DSP and RF systems. The field programmable gate array (FPGA) is being used increasingly in the field of DSP. This is due to the fact that the parallel computing power of such devices is ideal for today’s truly demanding DSP algorithms. Algorithms such as the QR-RLS update are computationally intensive and must be carried out at extremely high speeds (MHz). This means that the DSP processor is simply not an option. ASICs can be used but the expense of developing custom logic is prohibitive. The increased use of the FPGA in DSP means that there is a significant requirement for efficient arithmetic cores that utilises the resources on such devices. This thesis presents the research and development effort that was carried out to produce fixed point division and square root cores for use in a new Electronic Design Automation (EDA) tool for EnTegra, which is targeted at FPGA implementation of DSP systems. Further to this, a new technique for predicting the accuracy of CORDIC systems computing vector magnitudes and cosines/sines is presented. This work allows the most efficient CORDIC design for a specified level of accuracy to be found quickly and easily without the need to run lengthy simulations, as was the case before. The CORDIC algorithm is a technique using mainly shifts and additions to compute many arithmetic functions and is thus ideal for FPGA implementation.

A Subspace Based Approach to the Design, Implementation and Validation of Algorithms for Active Vibration Isolation Control

Gerard Nijsse

Vibration isolation endeavors to reduce the transmission of vibration energy from one structure (the source) to another (the receiver), to prevent undesirable phenomena such as sound radiation. A well-known method for achieving this is passive vibration isolation (PVI). In the case of PVI, mounts are used - consisting of springs and dampers - to connect the vibrating source to the receiver. The stiffness of the mount determines the fundamental resonance frequency of the mounted system and vibrations with a frequency higher than the fundamental resonance frequency are attenuated. Unfortunately, however, other design requirements (such as static stability) often impose a minimum allowable stiffness, thus limiting the achievable vibration isolation by passive means. A more promising method for vibration isolation is hybrid vibration isolation control. This entails that, in addition to PVI, an active vibration isolation control (AVIC) system is used with sensors, actuators and a control system that compensates for vibrations in the lower frequency range. Here, the use of a special form of AVIC using statically determinate stiff mounts is proposed. The mounts establish a statically determinate system of high stiffness connections in the actuated directions and of low stiffness connections in the unactuated directions. The latter ensures PVI in the unactuated directions. This approach is called statically determinate AVIC (SD-AVIC). The aim of the control system is to produce antidisturbance forces that counteract the disturbance forces stemming from the source. Using this approach, the vibration energy transfer from the source to the receiver is blocked in the mount due to the anti-forces. This thesis deals with the design of controllers generating the anti-forces by applying techniques that are commonly used in the field of signal processing. The control approaches - that are model-based - are both adaptive and fixed gain and feedforward and feedback oriented. The control approaches are validated using two experimental vibration isolation setups: a single reference single actuator single error sensor (SR-SISO) setup and a single reference input multiple actuator input multiple error sensor output (SR-MIMO) setup. Finding a plant model can be a problem. This is solved by using a black-box modelling strategy. The plants are identified using subspace model identification. It is shown that accurate linear models can be found in a straightforward manner by using small batches of recorded (sampled) time-domain data only. Based on the identified models, controllers are designed, implemented and validated. Due to resonance in mechanical structures, adaptive SD-AVIC systems are often hampered by slow convergence of the controller coefficients. In general, it is desirable that the SD-AVIC system yields fast optimum performance after it is switched on. To achieve this result and speed up the convergence of the adaptive controller coefficients, the so-called inverse outer factor model is included in the adaptive control scheme. The inner/outer factorization, that has to be performed to obtain the inverse outer factor model, is completely determined in state space to enable a numerically robust computation. The inverse outer factor model is also incorporated in the control scheme as a state space model. It is found that fast adaptation of the controller coefficients is possible. Controllers are designed, implemented and validated to suppress both narrowband and broadband disturbances. Scalar regularization is used to prevent actuator saturation and an unstable closed loop. In order to reduce the computational load of the controllers, several steps are taken including controller order reduction and implementation of lower order models. It is found that in all experiments the simulation and real-time results correspond closely for both the fixed gain and adaptive control situation. On the SR-SISO setup, reductions up to 5.0 dB are established in real-time for suppressing a broadband disturbance output (0-2 kHz) using feedback-control. On the SR-MIMO vibration isolation setup, using feedforward-control reductions of broadband disturbances (0-1 kHz) of 9.4 dB are established in real-time. Using feedback-control, reductions are established up to 3.5 dB in real-time (0-1 kHz). In case of the SR-MIMO setup, the values for the reduction are obtained by averaging the reductions obtained in all sensor outputs. The results pave the way for the next generation of algorithms for SD-AVIC.

Auditory System for a Mobile Robot

Jean-Marc Valin

The auditory system of living creatures provides useful information about the world, such as the location and interpretation of sound sources. For humans, it means to be able to focus one's attention on events, such as a phone ringing, a vehicle honking, a person taking, etc. For those who do not suffer from hearing impairments, it is hard to imagine a day without being able to hear, especially in a very dynamic and unpredictable world. Mobile robots would also benefit greatly from having auditory capabilities. In this thesis, we propose an artificial auditory system that gives a robot the ability to locate and track sounds, as well as to separate simultaneous sound sources and recognising simultaneous speech. We demonstrate that it is possible to implement these capabilities using an array of microphones, without trying to imitate the human auditory system. The sound source localisation and tracking algorithm uses a steered beamformer to locate sources, which are then tracked using a multi-source particle filter. Separation of simultaneous sound sources is achieved using a variant of the Geometric Source Separation (GSS) algorithm, combined with a multisource post-filter that further reduces noise, interference and reverberation. Speech recognition is performed on separated sources, either directly or by using Missing Feature Theory (MFT) to estimate the reliability of the speech features. The results obtained show that it is possible to track up to four simultaneous sound sources, even in noisy and reverberant environments. Real-time control of the robot following a sound source is also demonstrated. The sound source separation approach we propose is able to achieve a 13.7 dB improvement in signal-to-noise ratio compared to a single microphone when three speakers are present. In these conditions, the system demonstrates more than 80% accuracy on digit recognition, higher than most human listeners could obtain in our small case study when recognising only one of these sources. All these new capabilities will allow humans to interact more naturally with a mobile robot in real life settings.

Restoration of Nonlinearly Distorted Optical Soundtracks Using Regularized Inverse Characteristics

Tamas B. Bako

This dissertation is concerned with the possibilities of restoration of degraded film-sound. The sound-quality of old films are often not acceptable, which means that the sound is so noisy and distorted that the listener have to take strong efforts to understand the conversations in the film. In this case the film cannot give artistic enjoyment to the listener. This is the reason that several old films cannot be presented in movies or television. The quality of these films can be improved by digital restoration techniques. Since we do not have access to the original signal, only the distorted one, therefore we cannot adjust recording parameters or recording techniques. The only possibility is to post-compensate the signal to produce a better estimate about the undistorted, noiseless signal. In this dissertation new methods are proposed for fast and efficient restoration of nonlinear distortions in the optically recorded film soundtracks. First the nonlinear models and nonlinear restoration techniques are surveyed and the ill-posedness of nonlinear post-compensation (the extreme sensitivity to noise) is explained. The effects and sources of linear and nonlinear distortions at optical soundtracks are also described. A new method is proposed to overcome the ill-posedness of the restoration problem and to get an optimal result. The effectiveness of the algorithm is proven by simulations and restoration of real film-sound signals.

Through-Wall Imaging with UWB Radar System

Ing. Michal Aftanas, PhD.

Motivation: A man was interested in knowing of unknown from the very beginning of the human history. Our human eyes help us to investigate our environment by reflection of light. However, wavelengths of visible light allows transparent view through only a very small kinds of materials. On the other hand, Ultra WideBand (UWB) electromagnetic waves with frequencies of few Gigahertz are able to penetrate through almost all types of materials around us. With some sophisticated methods and a piece of luck we are able to investigate what is behind opaque walls. Rescue and security of the people is one of the most promising fields for such applications. Rescue: Imagine how useful can be information about interior of the barricaded building with terrorists and hostages inside for a policemen. The tactics of police raid can be build up on realtime information about ground plan of the room and positions of big objects inside. How useful for the firemen can be information about current interior state of the room before they get inside? Such hazardous environment, full of smoke with zero visibility, is very dangerous and each additional information can make the difference between life and death. Security: Investigating objects through plastic, rubber, dress or other nonmetallic materials could be highly useful as an additional tool to the existing x-ray scanners. Especially it could be used for scanning baggage at the airport, truckloads on borders, dangerous boxes, etc.

Audio Time-Scale Modification

David Dorran

Audio time-scale modification is an audio effect that alters the duration of an audio signal without affecting its perceived local pitch and timbral characteristics. There are two broad categories of time-scale modification algorithms, time-domain and frequency-domain. The computationally efficient time-domain techniques produce high quality results for single pitched signals such as speech, but do not cope well with more complex signals such as polyphonic music. The less efficient frequencydomain techniques have proven to be more robust and produce high quality results for a variety of signals; however they introduce a reverberant artefact into the output. This dissertation focuses on incorporating aspects of time-domain techniques into frequency-domain techniques in an attempt to reduce the presence of the reverberant artefact and improve upon computational demands. From a review of prior work it was found that there are a number of time-domain algorithms available and that the choice of algorithm parameters varies considerably in the literature. This finding prompted an investigation into the effects of the choice of parameters and a comparison of the various techniques employed in terms of computational requirements and output quality. The investigation resulted in the derivation of an efficient and flexible parameter set for use within time-domain implementations. Of the available frequency-domain approaches the phase vocoder and timedomain/ subband techniques offer an efficiency and robustness advantage over sinusoidal modelling and iterative phase update techniques, and as such were identified as suitable candidates for the provision of a framework for further investigation. Following from this observation, improvements in the quality produced by time-domain/subband techniques are realised through the use of a bark based subband partitioning approach and effective subband synchronisation techniques. In addition, computational and output quality improvements within a phase vocoder implementation are achieved by taking advantage of a certain level of flexibility in the choice of phase within such an implementation. The phase flexibility established is used to push or pull phase values into a phase coherent state. Further improvements are realised by incorporating features of time-domain algorithms into the system in order to provide a ‘good’ initial set of phase estimates; the transition to ‘perfect’ phase coherence is significantly reduced through this scheme, thereby improving the overall output quality produced. The result is a robust and efficient time-scale modification algorithm which draws upon various aspects of a number of general approaches to time-scale modification.

Interaction with Sound and Pre-Recorded Music: Novel Interfaces and Use Patterns

Tue Haste Andersen

Computers are changing the way sound and recorded music are listened to and used. The use of computers to playback music makes it possible to change and adapt music to different usage situations in ways that were not possible with analog sound equipment. In this thesis, interaction with pre-recorded music is investigated using prototypes and user studies. First, different interfaces for browsing music on consumer or mobile devices were compared. It was found that the choice of input controller, mapping and auditory feedback influences how the music was searched and how the interfaces were perceived. Search performance was not affected by the tested interfaces. Based on this study, several ideas for the future design of music browsing interfaces were proposed. Indications that search time depends linearly on distance to target were observed and examined in a related study where a movement time model for searching in a text document using scrolling was developed. Second, work practices of professional disc jockeys (DJs) were studied and a new design for digital DJing was proposed and tested. Strong indications were found that the use of beat information could reduce the DJ’s cognitive workload while maintaining flexibility during the musical performance. A system for automatic beat extraction was designed based on an evaluation of a number of perceptually important parameters extracted from audio signals. Finally, auditory feedback in pen-gesture interfaces was investigated through a series of informal and formal experiments. The experiments point to several general rules of auditory feedback in pen-gesture interfaces: a few simple functions are easy to achieve, gaining further performance and learning advantage is difficult, the gesture set and its computerized recognizer can be designed to minimize visual dependence, and positive emotional or aesthetic response can be achieved using musical auditory feedback.