## BLAS Comparison on FPGA, CPU and GPU

High Performance Computing (HPC) or scientific codes are being executed across a wide variety of computing platforms from embedded processors to massively parallel GPUs. We present a comparison of the Basic Linear Algebra Subroutines (BLAS) using double-precision floating point on an FPGA, CPU and GPU. On the CPU and GPU, we utilize standard libraries on state-of-the-art devices. On the FPGA, we have developed parameterized modular implementations for the dot product and Gaxpy or matrix-vector multiplication. In order to obtain optimal performance for any aspect ratio of the matrices, we have designed a high-throughput accumulator to perform an efficient reduction of floating point values. To support scalability to large data-sets, we target the BEE3 FPGA platform. We use performance and energy efficiency as metrics to compare the different platforms. Results show that FPGAs offer comparable performance as well as 2.7 to 293 times better energy efficiency for the test cases that we implemented on all three platforms.

## Wavelet Denoising for TDR Dynamic Range Improvement

A technique is presented for removing large amounts of noise present in time-domain-reflectometry (TDR) waveforms to increase the dynamic range of TDR waveforms and TDR based s-parameter measurements.

## Bilinear Transformation Made Easy

A formula is derived and demonstrated that is capable of directly generating digital filter coefficients from an analog filter prototype using the bilinear transformation. This formula obviates the need for any algebraic manipulation of the analog prototype filter and is ideal for use in embedded systems that must take in any general analog filter specification and dynamically generate digital filter coefficients directly usable in difference equations.

## FUZZY LOGIC BASED CONVOLUTIONAL DECODER FOR USE IN MOBILE TELEPHONE SYSTEMS

Efficient convolutional coding and decoding algorithms are most crucial to successful operation of wireless communication systems in order to achieve high quality of service by reducing the overall bit error rate performance. A widely applied and well evaluated scheme for error correction purposes is well known as Viterbi algorithm [7]. Although the Viterbi algorithm has very good error correcting characteristics, computational effort required remains high. In this paper a novel approach is discussed introducing a convolutional decoder design based on fuzzy logic. A simplified version of this fuzzy based decoder is examined with respect to bit error rate (BER) performance. It can be shown that the fuzzy based convolutional decoder here proposed considerably reduces computational effort with only minor BER performance degradation when compared to the classical Viterbi approach.

## Method to Calculate the Inverse of a Complex Matrix using Real Matrix Inversion

This paper describes a simple method to calculate the invers of a complex matrix. The key element of the method is to use a matrix inversion, which is available and optimised for real numbers. Some actual libraries used for digital signal processing only provide highly optimised methods to calculate the inverse of a real matrix, whereas no solution for complex matrices are available, like in [1]. The presented algorithm is very easy to implement, while still much more efficient than for example the method presented in [2]. [1] Visual DSP++ 4.0 C/C++ Compiler and Library Manual for TigerSHARC Processors; Analog Devices; 2005. [2] W. Press, S.A. Teukolsky, W.T. Vetterling, B.R. Flannery; Numerical Recipes in C++, The art of scientific computing, Second Edition; p52 : “Complex Systems of Equations”;Cambridge University Press 2002.

## Real Time Implementation of Multi-Level Perfect Signal Reconstruction Filter Bank

Discrete Wavelet Transform (DWT) is an efﬁcient tool for signal and image processing applications which has been utilized for perfect signal reconstruction. In this paper, twenty seven optimum combinations of three different wavelet ﬁlter types, three different ﬁlter reconstruction levels and three different kinds of signal for multi-level perfect reconstruction ﬁlter bank were implemented in MATLAB/Simulink. All the ﬁlters for different wavelet types were designed using Filter Design Analysis (FDA) and Wavelet toolbox. Signal to Noise Ratio (SNR) was calculated for each combination. Combination with best SNR was then implemented on TMS320C6713 DSP kit. Real time testing of perfect reconstruction on DSP kit was then carried out by two different methods. Experimental results accede with theory and simulations.

## Algorithm Adaptation and Optimization of a Novel DSP Vector Co-processor

The Division of Computer Engineering at Linköping's university is currently researching the possibility to create a highly parallel DSP platform, that can keep up with the computational needs of upcoming standards for various applications, at low cost and low power consumption. The architecture is called ePUMA and it combines a general RISC DSP master processor with eight SIMD co-processors on a single chip. The master processor will act as the main processor for general tasks and execution control, while the co-processors will accelerate computing intensive and parallel DSP kernels.This thesis investigates the performance potential of the co-processors by implementing matrix algebra kernels for QR decomposition, LU decomposition, matrix determinant and matrix inverse, that run on a single co-processor. The kernels will then be evaluated to find possible problems with the co-processors' microarchitecture and suggest solutions to the problems that might exist. The evaluation shows that the performance potential is very good, but a few problems have been identified, that causes significant overhead in the kernels. Pipeline mismatches, that occurs due to different pipeline lengths for different instructions, causes pipeline hazards and the current solution to this, doesn't allow effective use of the pipeline. In some cases, the single port memories will cause bottlenecks, but the thesis suggests that the situation could be greatly improved by using buffered memory write-back. Also, the lack of register forwarding makes kernels with many data dependencies run unnecessarily slow.

## Correlation and Power Spectrum

In the signals and systems course and in the first course in digital signal processing, a signal is, most often, characterized by its amplitude spectrum in the frequency-domain and its amplitude profile in the time-domain. So much a student gets used to this type of characterization, that the student finds it difficult to appreciate, when encountered in the ensuing statistical signal processing course, the fact that a signal can also be characterized by its autocorrelation function in the time-domain and the corresponding power spectrum in the frequency-domain and that the amplitude characterization is not available. In this article, the characterization of a signal by its autocorrelation function in the time-domain and the corresponding power spectrum in the frequency-domain is described. Cross-correlation of two signals is also presented.

## Digital Signal Processing Maths

Modern digital signal processing makes use of a variety of mathematical techniques. These techniques are used to design and understand efficient filters for data processing and control.

## Auditory Component Analysis Using Perceptual Pattern Recognition to Identify and Extract Independent Components From an Auditory Scene

The cocktail party effect, our ability to separate a sound source from a multitude of other sources, has been researched in detail over the past few decades, and many investigators have tried to model this on computers. Two of the major research areas currently being evaluated for the so-called sound source separation problem are Auditory Scene Analysis (Bregman 1990) and a class of statistical analysis techniques known as Independent Component Analysis (Hyvärinen 2001). This paper presents a methodology for combining these two techniques. It suggests a framework that first separates sounds by analyzing the incoming audio for patterns and synthesizing or filtering them accordingly, measures features of the resulting tracks, and finally separates sounds statistically by matching feature sets and making the output streams statistically independent. Artificial and acoustical mixes of sounds are used to evaluate the signal-to-noise ratio where the signal is the desired source and the noise is comprised of all other sources. The proposed system is found to successfully separate audio streams. The amount of separation is inversely proportional to the amount of reverberation present.

## A DSP Implementation of OFDM Acoustic Modem

The success of multicarrier modulation in the form of OFDM in radio channels illuminates a path one could take towards high-rate underwater acoustic communications, and recently there are intensive investigations on underwater OFDM. In this paper, we implement the acoustic OFDM transmitter and receiver design of [4, 5] on a TMS320C6713 DSP board. We analyze the workload and identify the most time-consuming operations. Based on the workload analysis, we tune the algorithms and optimize the code to substantially reduce the synchronization time to 0.2 seconds and the processing time of one OFDM block to 1.7 seconds on a DSP processor at 225 MHz. This experimentation provides guidelines on our future work to reduce the per-block processing time to be less than the block duration of 0.23 seconds for real time operations.

## Teaching MODEM Concepts and Design Procedure with MATLAB Simulations

MATLAB simulation is used as the primary tool to illustrate concepts, to validate MODEM designs, and to vent' operation of the subsystems employed in DSP based transmitters and receivers presented in a pair of classes on MODEM Design and Digital Receiver Design. The whole gamut of subsystems found in conventional and experimental modem designs are simulated and assembled to form a full end-to-end simulation of an operating MODEM. This paper describes the philosophy used to guide class involvement and assess the experience and the learning value to student participants.

## Reduced-Delay IIR Filters

This document describes a straightforward method to significantly reduce the number of necessary multiplies per input sample of traditional IIR lowpass and highpass digital filters.

## An Experimental Multichannel Pulse Code Modulation System of Toll Quality + Electron Beam Deflection Tube For Pulse Code Modulation

See this blog post for context. Pulse Code Modulation offers attractive possibilities for multiplex telephony via such media as the microwave radio relay. The various problems involved in its use have been explored in terms of a 96-channel system designed to meet the transmission requirements commonly imposed upon commercial toll circuits. Twenty-four of the 96 channels have been fully equipped in an experimental model of the system. Coding and decoding devices are described, along with other circuit details. The coder is based upon a new electron beam tube, and is characterized by speed and simplicity as well as accuracy of coding. These qualities are matched in the decoder, which employs pulse excitation of a simple reactive network.

## Algorithms, Architectures, and Applications for Compressive Video Sensing

The design of conventional sensors is based primarily on the Shannon-Nyquist sampling theorem, which states that a signal of bandwidth W Hz is fully determined by its discrete-time samples provided the sampling rate exceeds 2W samples per second. For discrete-time signals, the Shannon-Nyquist theorem has a very simple interpretation: the number of data samples must be at least as large as the dimensionality of the signal being sampled and recovered. This important result enables signal processing in the discrete-time domain without any loss of information. However, in an increasing number of applications, the Shannon-Nyquist sampling theorem dictates an unnecessary and often prohibitively high sampling rate. (See Box 1 for a derivation of the Nyquist rate of a time-varying scene.) As a motivating example, the high resolution of the image sensor hardware in modern cameras reflects the large amount of data sensed to capture an image. A 10-megapixel camera, in effect, takes 10 million measurements of the scene. Yet, almost immediately after acquisition, redundancies in the image are exploited to compress the acquired data significantly, often at compression ratios of 100:1 for visualization and even higher for detection and classification tasks. This example suggests immense wastage in the overall design of conventional cameras.

## Using the DFT as a Filter: Correcting a Misconception

I have read, in some of the literature of DSP, that when the discrete Fourier transform (DFT) is used as a filter the process of performing a DFT causes an input signal's spectrum to be frequency translated down to zero Hz (DC). I can understand why someone might say that, but I challenge that statement as being incorrect. Here are my thoughts.

## An Introduction To Compressive Sampling

This article surveys the theory of compressive sensing, also known as compressed sensing or CS, a novel sensing/sampling paradigm that goes against the common wisdom in data acquisition.

## Hilbert Transform and Applications

Section 1: reviews the mathematical deﬁnition of Hilbert transform and various ways to calculate it.

Sections 2 and 3: review applications of Hilbert transform in two major areas: Signal processing and system identiﬁcation.

Section 4: concludes with remarks on the historical development of Hilbert transform

## Novel Method of Showing Frequency Transients in the Fourier Transform and it’s Application in Time-Frequency Analysis

Fourier Transform in the frequency domain is modified to also analyse frequency transients i.e. changes in the frequency spectrum with time variable of any order. This is analytically, a very useful tool as there are many problems where frequency variation with time has to be analyzed e.g. Doppler shift, Light through different mediums in time and space. Numerical calculations are usually done for such problems when needed. Here, Fourier transform is analyzed to incorporate more variables that simultaneously do the Time lag-Frequency Analysis (TLFA) from Fourier Transform by changing the Fourier Operator. Also, the Frequency Derivative Analysis (FDA) of any order can be analyzed from Fourier Transform. Validity of the operator is examined using Eigen value analysis and operator algebra.

## Adaptive distributed noise reduction for speech enhancement in wireless acoustic sensor networks

An adaptive distributed noise reduction algorithm for speech enhancement is considered, which operates in a wireless acoustic sensor network where each node collects multiple microphone signals. In previous work, it was shown theoretically that for a stationary scenario, the algorithm provides the same signal estimators as the centralized multi-channel Wiener filter, while significantly compressing the data that is transmitted between the nodes. Here, we present simulation results of a fully adaptive implementation of the algorithm, in a non-stationary acoustic scenario with a moving speaker and two babble noise sources. The algorithm is implemented using a weighted overlap-add technique to reduce the overall input-output delay. It is demonstrated that good results can be obtained by estimating the required signal statistics with a long-term forgetting factor without downdating, even though the signal statistics change along with the iterative filter updates. It is also demonstrated that simultaneous node updating provides a significantly smoother and faster tracking performance compared to sequential node updating.