## BLAS Comparison on FPGA, CPU and GPU

High Performance Computing (HPC) or scientific codes are being executed across a wide variety of computing platforms from embedded processors to massively parallel GPUs. We present a comparison of the Basic Linear Algebra Subroutines (BLAS) using double-precision floating point on an FPGA, CPU and GPU. On the CPU and GPU, we utilize standard libraries on state-of-the-art devices. On the FPGA, we have developed parameterized modular implementations for the dot product and Gaxpy or matrix-vector multiplication. In order to obtain optimal performance for any aspect ratio of the matrices, we have designed a high-throughput accumulator to perform an efficient reduction of floating point values. To support scalability to large data-sets, we target the BEE3 FPGA platform. We use performance and energy efficiency as metrics to compare the different platforms. Results show that FPGAs offer comparable performance as well as 2.7 to 293 times better energy efficiency for the test cases that we implemented on all three platforms.

## Biosignal processing challenges in emotion recognition for adaptive learning

User-centered computer based learning is an emerging field of interdisciplinary research. Research in diverse areas such as psychology, computer science, neuroscience and signal processing is making contributions to take this field to the next level. Learning systems built using contributions from these fields could be used in actual training and education instead of just laboratory proof-of-concept. One of the important advances in this research is the detection and assessment of the cognitive and emotional state of the learner using such systems. This capability moves development beyond the use of traditional user performance metrics to include system intelligence measures that are based on current theories in neuroscience. These advances are of paramount importance in the success and wide spread use of learning systems that are automated and intelligent. Emotion is considered an important aspect of how learning occurs, and yet estimating it and making adaptive adjustments are not part of most learning systems. In this research we focus on one specific aspect of constructing an adaptive and intelligent learning system, that is, estimation of the emotion of the learner as he/she is using the automated training system. The challenge starts with the definition of the emotion and the utility of it in human life. The next challenge is to measure the co-varying factors of the emotions in a non-invasive way, and find consistent features from these measures that are valid across wide population. In this research we use four physiological sensors that are non-invasive, and establish a methodology of utilizing the data from these sensors using different signal processing tools. A validated set of visual stimuli used worldwide in the research of emotion and attention, called International Affective Picture System (IAPS), is used. A dataset is collected from the sensors in an experiment designed to elicit emotions from these validated visual stimuli. We describe a novel wavelet method to calculate hemispheric asymmetry metric using electroencephalography data. This method is tested against typically used power spectral density method. We show overall improvement in accuracy in classifying specific emotions using the novel method. We also show distinctions between different discrete emotions from the autonomic nervous system activity using electrocardiography, electrodermal activity and pupil diameter changes. Findings from different features from these sensors are used to give guidelines to use each of the individual sensors in the adaptive learning environment.

## Gauss-Newton Based Learning for Fully Recurrent Neural Networks

The thesis discusses a novel off-line and on-line learning approach for Fully Recurrent Neural Networks (FRNNs). The most popular algorithm for training FRNNs, the Real Time Recurrent Learning (RTRL) algorithm, employs the gradient descent technique for finding the optimum weight vectors in the recurrent neural network. Within the framework of the research presented, a new off-line and on-line variation of RTRL is presented, that is based on the Gauss-Newton method. The method itself is an approximate Newton’s method tailored to the specific optimization problem, (non-linear least squares), which aims to speed up the process of FRNN training. The new approach stands as a robust and effective compromise between the original gradient-based RTRL (low computational complexity, slow convergence) and Newton-based variants of RTRL (high computational complexity, fast convergence). By gathering information over time in order to form Gauss-Newton search vectors, the new learning algorithm, GN-RTRL, is capable of converging faster to a better quality solution than the original algorithm. Experimental results reflect these qualities of GN-RTRL, as well as the fact that GN-RTRL may have in practice lower computational cost in comparison, again, to the original RTRL.

## Wavelet Denoising for TDR Dynamic Range Improvement

A technique is presented for removing large amounts of noise present in time-domain-reflectometry (TDR) waveforms to increase the dynamic range of TDR waveforms and TDR based s-parameter measurements.

## Bilinear Transformation Made Easy

A formula is derived and demonstrated that is capable of directly generating digital filter coefficients from an analog filter prototype using the bilinear transformation. This formula obviates the need for any algebraic manipulation of the analog prototype filter and is ideal for use in embedded systems that must take in any general analog filter specification and dynamically generate digital filter coefficients directly usable in difference equations.

## FUZZY LOGIC BASED CONVOLUTIONAL DECODER FOR USE IN MOBILE TELEPHONE SYSTEMS

Efficient convolutional coding and decoding algorithms are most crucial to successful operation of wireless communication systems in order to achieve high quality of service by reducing the overall bit error rate performance. A widely applied and well evaluated scheme for error correction purposes is well known as Viterbi algorithm [7]. Although the Viterbi algorithm has very good error correcting characteristics, computational effort required remains high. In this paper a novel approach is discussed introducing a convolutional decoder design based on fuzzy logic. A simplified version of this fuzzy based decoder is examined with respect to bit error rate (BER) performance. It can be shown that the fuzzy based convolutional decoder here proposed considerably reduces computational effort with only minor BER performance degradation when compared to the classical Viterbi approach.

## Method to Calculate the Inverse of a Complex Matrix using Real Matrix Inversion

This paper describes a simple method to calculate the invers of a complex matrix. The key element of the method is to use a matrix inversion, which is available and optimised for real numbers. Some actual libraries used for digital signal processing only provide highly optimised methods to calculate the inverse of a real matrix, whereas no solution for complex matrices are available, like in [1]. The presented algorithm is very easy to implement, while still much more efficient than for example the method presented in [2]. [1] Visual DSP++ 4.0 C/C++ Compiler and Library Manual for TigerSHARC Processors; Analog Devices; 2005. [2] W. Press, S.A. Teukolsky, W.T. Vetterling, B.R. Flannery; Numerical Recipes in C++, The art of scientific computing, Second Edition; p52 : “Complex Systems of Equations”;Cambridge University Press 2002.

## Fully Programmable LDPC Decoder Hardware Architectures

In recent years, the amount of digital data which is stored and transmitted for private and public usage has increased considerably. To allow a save transmission and storage of data despite of error-prone transmission media, error correcting codes are used. A large variety of codes has been developed, and in the past decade low-density parity-check (LDPC) codes which have an excellent error correction performance became more and more popular. Today, low-density parity-check codes have been adopted for several standards, and eﬃcient decoder hardware architectures are known for the chosen structured codes. However, the existing decoder designs lack ﬂexibility as only few structured codes can be decoded with one decoder chip. In consequence, diﬀerent codes require a redesign of the decoder, and few solutions exist for decoding of codes which are not quasi-cyclic or which are unstructured. In this thesis, three diﬀerent approaches are presented for the implementation of fully programmable LDPC decoders which can decode arbitrary LDPC codes. As a design study, the ﬁrst programmable decoder which uses a heuristic mapping algorithm is realized on an ﬁeld-programmable gate array (FPGA), and error correction curves are measured to verify the correct functionality. The main contribution of this thesis lies in the development of the second and the third architecture and an appropriate mapping algorithm. The proposed fully programmable decoder architectures use one-phase message passing and layered decoding and can decode arbitrary LDPC codes using an optimum mapping and scheduling algorithm. The presented programmable architectures are in fact generalized decoder architectures from which the known decoders architectures for structured LDPC codes can be derived.

## Design of a Scalable Polyphony-MIDI Synthesizer for a Low Cost DSP

In this thesis, the design of a music synthesizer implementing the Scalable Polyphony-MIDI soundset on a low cost DSP system is presented. First, the SP-MIDI standard and the target DSP platform are presented followed by review of commonly used synthesis techniques and their applicability to systems with limited computational and memory resources. Next, various oscillator and ﬁlter algorithms used in digital subtractive synthesis are reviewed in detail. Special attention is given to the aliasing problem caused by discontinuities in classical waveforms, such as sawtooth and pulse waves and existing methods for bandlimited waveform synthesis are presented. This is followed by review of established structures for computationally efﬁcient time-varying ﬁlters. A novel digital structure is presented that decouples the cutoff and resonance controls. The new structure is based on the analog Korg MS-20 lowpass ﬁlter and is computationally very efﬁcient and well suited for implementation on low bitdepth architectures. Finally, implementation issues are discussed with emphasis on the Differentiated Parabole Wave oscillator and MS-20 ﬁlter structures and the effects of limited computational capability and low bitdepth. This is followed by designs for several example instruments.

## Implementation of a Tx/Rx OFDM System in a FPGA

The aim of this project consists in the FPGA design and implementation of a transmitter and receiver (Tx/Rx) multicarrier system such the Orthogonal Frequency Division Multiplexing (OFDM). This Tx/Rx OFDM subsystem is capable to deal with with different M-QAM modulations and is implemented in a digital signal processor (DSP-FPGA). The implementation of the Tx/Rx subsystem has been carried out in a FPGA using both System Generator visual programming running over Matlab/Simulink, and the Xilinx ISE program which uses VHDL language. This project is divided into four chapters, each one with a concrete objective. The first chapter is a brief introduction to the digital signal processor used, a field-programmable gate array (FPGA), and to the VHDL programming language. The second chapter is an overview on OFDM, its main advantages and disadvantages in front of previous systems, and a brief description of the different blocks composing the OFDM system. Chapter three provides the implementation details for each of these blocks, and also there is a brief explanation on the theory behind each of the OFDM blocks to provide a better comprehension on its implementation. The fourth chapter is focused, on the one hand, in showing the results of the Matlab/Simulink simulations for the different simulation schemes used and, on the other hand, to show the experimental results obtained using the FPGA to generate the OFDM signal at baseband and then upconverted at the frequency of 3,5 GHz. Finally the conclusions regarding the whole Tx/Rx design and implementation of the OFDM subsystem are given.

## Adaptive distributed noise reduction for speech enhancement in wireless acoustic sensor networks

An adaptive distributed noise reduction algorithm for speech enhancement is considered, which operates in a wireless acoustic sensor network where each node collects multiple microphone signals. In previous work, it was shown theoretically that for a stationary scenario, the algorithm provides the same signal estimators as the centralized multi-channel Wiener filter, while significantly compressing the data that is transmitted between the nodes. Here, we present simulation results of a fully adaptive implementation of the algorithm, in a non-stationary acoustic scenario with a moving speaker and two babble noise sources. The algorithm is implemented using a weighted overlap-add technique to reduce the overall input-output delay. It is demonstrated that good results can be obtained by estimating the required signal statistics with a long-term forgetting factor without downdating, even though the signal statistics change along with the iterative filter updates. It is also demonstrated that simultaneous node updating provides a significantly smoother and faster tracking performance compared to sequential node updating.

## BLAS Comparison on FPGA, CPU and GPU

High Performance Computing (HPC) or scientific codes are being executed across a wide variety of computing platforms from embedded processors to massively parallel GPUs. We present a comparison of the Basic Linear Algebra Subroutines (BLAS) using double-precision floating point on an FPGA, CPU and GPU. On the CPU and GPU, we utilize standard libraries on state-of-the-art devices. On the FPGA, we have developed parameterized modular implementations for the dot product and Gaxpy or matrix-vector multiplication. In order to obtain optimal performance for any aspect ratio of the matrices, we have designed a high-throughput accumulator to perform an efficient reduction of floating point values. To support scalability to large data-sets, we target the BEE3 FPGA platform. We use performance and energy efficiency as metrics to compare the different platforms. Results show that FPGAs offer comparable performance as well as 2.7 to 293 times better energy efficiency for the test cases that we implemented on all three platforms.

## FUZZY LOGIC BASED CONVOLUTIONAL DECODER FOR USE IN MOBILE TELEPHONE SYSTEMS

Efficient convolutional coding and decoding algorithms are most crucial to successful operation of wireless communication systems in order to achieve high quality of service by reducing the overall bit error rate performance. A widely applied and well evaluated scheme for error correction purposes is well known as Viterbi algorithm [7]. Although the Viterbi algorithm has very good error correcting characteristics, computational effort required remains high. In this paper a novel approach is discussed introducing a convolutional decoder design based on fuzzy logic. A simplified version of this fuzzy based decoder is examined with respect to bit error rate (BER) performance. It can be shown that the fuzzy based convolutional decoder here proposed considerably reduces computational effort with only minor BER performance degradation when compared to the classical Viterbi approach.

## Algorithms and tools for automatic generation of DSP hardware structures

The increased complexity of Digital Signal Processing (DSP) algorithms demands for the development of more complex and more eﬃcient hardware structures. The work presented herein describes the core components for the development of a tool capable of automatic generation of eﬃcient hardware structures, therefore facilitating developers work. It comprises algorithms and techniques for i) balancing the paths in a graph, ii) scheduling of operations to functional units, iii) allocating registers and iv) generating the VHDL code. Results show that the developed techniques are capable of generating the hardware structure of typical DSP algorithms represented in data-ﬂow graphs with over 2,000 nodes in around 200 ms, scaling to 80,000 nodes in about 214 s. Within the developed techniques, solving the scheduling problem is one of the most complex tasks: it is a NP-complete problem and directly inﬂuences the number of functional units and registers required. Therefore, experimental analysis was made on scheduling algorithms for time-constrained problems. Results show that simple list-based algorithms are more eﬃcient in large problems than more complex algorithms: they run faster and tend to require less functional units.

## Music Signal Processing

Chapter 12 of the book "Multimedia Signal Processing: Theory and Applications in Speech, Music and Communications" - Musical Instruments - A Review of Basic Physics of Sound - Music Signal Features and Models - Ear: Hearing of Sounds - Psychoacoustics of Hearing - Music Compression - High Quality Music Coding: MPEG - Stereo Music - Music Recognition

## Evaluation of a Floating Point Acoustic Echo Canceller Implementation

This master thesis consists of implementation and evaluation of an AEC, Acoustic Echo Canceller, algorithm in a floating-point architecture. The most important question this thesis will try to answer is to determine benefits or drawbacks of using a floating-point architecture, relative a fixed-point architecture, to do AEC. In a telephony system there is two common forms of echo, line echo and acoustic echo. Acoustic echo is introduced by sound emanating from a loudspeaker, e.g. in a handsfree or speakerphone, being picked up by a microphone and then sent back to the source. The problem with this feedback is that the far-end speaker will hear one, or multiple, time-delayed version(s) of her own speech. This time-delayed version of speech is usually perceived as both confusing and annoying unless removed by the use of AEC. In this master thesis the performance of a floating-point version of a normalized least-mean-square AEC algorithm was evaluated in an environment designed and implemented to approximate live telephony calls. An instruction-set simulator and assembler available at the initiation of this master thesis were extended to enable; zero-overhead loops, modular addressing, post-increment of registers and register-write forwarding. With these improvements a bit-true assembly version was implemented capable of real-time AEC requiring 15 million instructions per second. A solution using as few as eight mantissa bits, in an external format used when storing data in memory, was found to have an insignificant effect on the selected AEC implementation’s performance. Due to the relatively low memory requirement of the selected AEC algorithm, the use of a small external format has a minor effect on the required memory size. In total this indicates that the possible reduction of the memory requirement and related energy consumption, does not justify the added complexity and energy consumption of using a floating-point architecture for the selected algorithm. Use of a floating-point format can still be advantageous in speech-related signal processing when the introduced time delay by a subband, or a similar frequency domain, solution is unacceptable. Speech algorithms that have high memory use and small introduced delay requirements are a good candidate for a floating-point digital signal processor architecture.

## Decoding Ogg Vorbis Audio with The C6416 DSP, using a custom made MDCT core on FPGA

Ogg Vorbis is a fairly new and growing audio format, often used for online distribution of music and internet radio stations for streaming audio. It is considered to be better than MP3 in both quality and compression and in the same league as for example AAC. In contrast with many other formats, like MP3 and AAC, Ogg Vorbis is patent and royalty free. The purpose of this thesis project was to investigate how the C6416 DSP processor and a Stratix II FPGA could be connected to each other and work together as co-processors and using an Ogg Vorbis decoder as implementation example. A fixed-point decoder called Tremor (developed by Xiph.Org the creator of the Vorbis I specification), has been ported to the DSP processor and an Ogg Vorbis player has been developed. Tremor was profiled before performing the software / hardware partitioning to decide what parts of the source code of Tremor that should be implemented in the FPGA to off-load and accelerate the DSP.

## Optimization of Audio Processing algorithms (Reverb) on ARMv6 family of processors

Audio processing algorithms are increasingly used in cell phones and today’s customers are placing more demands on cell phones. Feature phones, once the advent of mobile phone technology, nowadays do more than just providing the user with MP3 play back or advanced audio effects. These features have become an integral part of medium as well as low-end phones. On the other hand, there is also an endeavor to include as improved quality as possible into products to compete in market and satisfy users’ needs. Tackling the above requirements has been partly satisfied by the advance in hardware design and manufacturing technology. However, as new hardware emerges into market the need for competence to write efficient software and exploit the new features thoroughly and effectively arises. Even though compilers are also keeping up with the new tide space for hand optimized code still exist. Wrapped in the above goal, an effort was made in this thesis to partly cover the competence requirement at Multimedia Section (part of Ericsson Mobile Platforms) to develope optimized code for new processors. Forging persistently ahead with new products, EMP has always incorporated the latest technology into its products among which ARMv6 family of processors has the main central processing role in a number of upcoming products. To fully exploit latest features provided by ARMv6, it was required to probe its new instruction set among which new media processing instructions are of outmost importance. In order to execute DSP-intensive algorithms (e.g. Audio Processing algorithms) efficiently, the implementation should be done in low-level code applying available instruction set. Meanwhile, ARMv6 comes with a number of new features in comparison with its predecessors. SIMD (Single Instruction Multiple Data) and VFP (Vector Floating Point) are the most prominent media processing improvements in ARMv6. Aligned with thesis goals and guidelines, Reverb algorithm which is among one of the most complicated audio features on a hand-held devices was probed. Consequently, its kernel parts were identified and implementation was done both in fixed-point and floating-point using the available resources on hardware. Besides execution time and amount of code memory for each part were measured and provided in tables and charts for comparison purposes. Conclusions were finally drawn based on developed code’s efficiency over ARM compiler’s as well as existing code already developed and tailored to ARMv5 processors. The main criteria for optimization was the execution time. Moreover, quantization effect due to limited precision fixed-point arithmetic was formulated and its effect on quality was elaborated. The outcomes, clearly indicate that hand optimization of kernel parts are superior to Compiler optimized alternative both from the point of code memory as well as execution time. The results also confirmed the presumption that hand optimized code using new instruction set can improve efficiency by an average 25%-50% depending on the algorithm structure and its interaction with other parts of audio effect. Despite its many draw backs, fixed-point implementation remains yet to be the dominant implementation for majority of DSP algorithms on low-power devices.

## Efficient arithmetic for high speed DSP implementation on FPGAs

The author was sponsored by EnTegra Ltd, a company who develop hardware and software products and services for the real time implementation of DSP and RF systems. The field programmable gate array (FPGA) is being used increasingly in the field of DSP. This is due to the fact that the parallel computing power of such devices is ideal for today’s truly demanding DSP algorithms. Algorithms such as the QR-RLS update are computationally intensive and must be carried out at extremely high speeds (MHz). This means that the DSP processor is simply not an option. ASICs can be used but the expense of developing custom logic is prohibitive. The increased use of the FPGA in DSP means that there is a significant requirement for efficient arithmetic cores that utilises the resources on such devices. This thesis presents the research and development effort that was carried out to produce fixed point division and square root cores for use in a new Electronic Design Automation (EDA) tool for EnTegra, which is targeted at FPGA implementation of DSP systems. Further to this, a new technique for predicting the accuracy of CORDIC systems computing vector magnitudes and cosines/sines is presented. This work allows the most efficient CORDIC design for a specified level of accuracy to be found quickly and easily without the need to run lengthy simulations, as was the case before. The CORDIC algorithm is a technique using mainly shifts and additions to compute many arithmetic functions and is thus ideal for FPGA implementation.

## Audio Time-Scale Modification

Audio time-scale modification is an audio effect that alters the duration of an audio signal without affecting its perceived local pitch and timbral characteristics. There are two broad categories of time-scale modification algorithms, time-domain and frequency-domain. The computationally efficient time-domain techniques produce high quality results for single pitched signals such as speech, but do not cope well with more complex signals such as polyphonic music. The less efficient frequencydomain techniques have proven to be more robust and produce high quality results for a variety of signals; however they introduce a reverberant artefact into the output. This dissertation focuses on incorporating aspects of time-domain techniques into frequency-domain techniques in an attempt to reduce the presence of the reverberant artefact and improve upon computational demands. From a review of prior work it was found that there are a number of time-domain algorithms available and that the choice of algorithm parameters varies considerably in the literature. This finding prompted an investigation into the effects of the choice of parameters and a comparison of the various techniques employed in terms of computational requirements and output quality. The investigation resulted in the derivation of an efficient and flexible parameter set for use within time-domain implementations. Of the available frequency-domain approaches the phase vocoder and timedomain/ subband techniques offer an efficiency and robustness advantage over sinusoidal modelling and iterative phase update techniques, and as such were identified as suitable candidates for the provision of a framework for further investigation. Following from this observation, improvements in the quality produced by time-domain/subband techniques are realised through the use of a bark based subband partitioning approach and effective subband synchronisation techniques. In addition, computational and output quality improvements within a phase vocoder implementation are achieved by taking advantage of a certain level of flexibility in the choice of phase within such an implementation. The phase flexibility established is used to push or pull phase values into a phase coherent state. Further improvements are realised by incorporating features of time-domain algorithms into the system in order to provide a ‘good’ initial set of phase estimates; the transition to ‘perfect’ phase coherence is significantly reduced through this scheme, thereby improving the overall output quality produced. The result is a robust and efficient time-scale modification algorithm which draws upon various aspects of a number of general approaches to time-scale modification.