Implementation of Algorithms on FPGAs

Mattias Karlsson

This thesis describes how an algorithm is transferred from a digital signal processor to an embedded microprocessor in an FPGA using C to hardware program from Altera. Saab Avitronics develops the secondary high lift control system for the Boeing 787 aircraft. The high lift system consists of electric motors controlling the trailing edge wing flaps and the leading edge wing slats. The high lift motors manage to control the Boeing 787 aircraft with full power even if half of each motor’s stators are damaged. The motor is a PMDC brushless motor which is controlled by an advanced algorithm. The algorithm needs to be calculated by a fast special digital signal processor. In this thesis I have tested if the algorithm can be transferred to an FPGA and still manage the time and safety demands. This was done by transferring an already working algorithm from the digital signal processor to an FPGA. The idea was to put the algorithm in an embedded NIOS II microprocessor and speed up the bottlenecks with Altera’s C to hardware program. The study shows that the C-code needs to be optimized for C to hardware to manage the up speeding part, as the tests showed that the calculation time for the algorithm actually became longer with C to hardware. This thesis also shows that it is highly probable to use an FPGA equipped with Altera’s NIOS II safety critical microprocessor instead of a digital signal processor to control the electrical high lift motors in the Boeing 787 aircraft.


DSP Platform Benchmarking

Luo Xinyuan
1 comment

Benchmarking of DSP kernel algorithms was conducted in the thesis on a DSP processor for teaching in the course TESA26 in the department of Electrical Engineering. It includes benchmarking on cycle count and memory usage. The goal of the thesis is to evaluate the quality of a single MAC DSP instruction set and provide suggestions for further improvement in instruction set architecture accordingly. The scope of the thesis is limited to benchmark the processor only based on assembly coding. The quality check of compiler is not included. The method of the benchmarking was proposed by BDTI, Berkeley Design Technology Incorporations, which is the general methodology used in world wide DSP industry. Proposals on assembly instruction set improvements include the enhancement of FFT and DCT. The cycle cost of the new FFT benchmark based on the proposal was XX% lower, showing that the proposal was right and qualified. Results also show that the proposal promotes the cycle cost score for matrix computing, especially matrix multiplication. The benchmark results were compared with general scores of single MAC DSP processors offered by BDTI.


Efficient arithmetic for high speed DSP implementation on FPGAs

Steven W. Alexander

The author was sponsored by EnTegra Ltd, a company who develop hardware and software products and services for the real time implementation of DSP and RF systems. The field programmable gate array (FPGA) is being used increasingly in the field of DSP. This is due to the fact that the parallel computing power of such devices is ideal for today’s truly demanding DSP algorithms. Algorithms such as the QR-RLS update are computationally intensive and must be carried out at extremely high speeds (MHz). This means that the DSP processor is simply not an option. ASICs can be used but the expense of developing custom logic is prohibitive. The increased use of the FPGA in DSP means that there is a significant requirement for efficient arithmetic cores that utilises the resources on such devices. This thesis presents the research and development effort that was carried out to produce fixed point division and square root cores for use in a new Electronic Design Automation (EDA) tool for EnTegra, which is targeted at FPGA implementation of DSP systems. Further to this, a new technique for predicting the accuracy of CORDIC systems computing vector magnitudes and cosines/sines is presented. This work allows the most efficient CORDIC design for a specified level of accuracy to be found quickly and easily without the need to run lengthy simulations, as was the case before. The CORDIC algorithm is a technique using mainly shifts and additions to compute many arithmetic functions and is thus ideal for FPGA implementation.


Algorithm Adaptation and Optimization of a Novel DSP Vector Co-processor

Andréas Karlsson

The Division of Computer Engineering at Linköping's university is currently researching the possibility to create a highly parallel DSP platform, that can keep up with the computational needs of upcoming standards for various applications, at low cost and low power consumption. The architecture is called ePUMA and it combines a general RISC DSP master processor with eight SIMD co-processors on a single chip. The master processor will act as the main processor for general tasks and execution control, while the co-processors will accelerate computing intensive and parallel DSP kernels.This thesis investigates the performance potential of the co-processors by implementing matrix algebra kernels for QR decomposition, LU decomposition, matrix determinant and matrix inverse, that run on a single co-processor. The kernels will then be evaluated to find possible problems with the co-processors' microarchitecture and suggest solutions to the problems that might exist. The evaluation shows that the performance potential is very good, but a few problems have been identified, that causes significant overhead in the kernels. Pipeline mismatches, that occurs due to different pipeline lengths for different instructions, causes pipeline hazards and the current solution to this, doesn't allow effective use of the pipeline. In some cases, the single port memories will cause bottlenecks, but the thesis suggests that the situation could be greatly improved by using buffered memory write-back. Also, the lack of register forwarding makes kernels with many data dependencies run unnecessarily slow.


Implementation of Elementary Functions for a Fixed Point SIMD DSP Coprocessor

Orri Tomasson

This thesis is about implementing the functions for reciprocal, square root, inverse square root and logarithms on a DSP platform. A multi-core DSP platform that consists of one master processor core and several SIMD coprocessor cores is currently being designed by a team at the Computer Engineering Department of Linköping University. The SIMD coprocessors’ arithmetic logic unit (ALU) has 16 multipliers to support vector multiplication instructions. By efficiently using the 16 multipliers, it is possible to evaluate polynomials very fast. The ALU does not have (hardware) support for floating point arithmetic, so the challenge is to get good precision by using fixed point arithmetic. Precise and fast solutions to implement the mathematical functions are found by converting the fixed point input to a soft floating point format before polynomial approximation, choosing a polynomial based on an error analysis of the polynomial approximation, and using Newton-Raphson or Goldschmidt iterations to improve the precision of the polynomial approximations. Finally, suggestions are made of changes and additions to the instruction set architecture, in order to make the implementations faster, by efficiently using the currently existing hardware.


Benchmarking a DSP processor

Per Lennartsson, Lars Nordlander

This Master thesis describes the benchmarking of a DSP processor. Benchmarking means measuring the performance in some way. In this report, we have focused on the number of instruction cycles needed to execute certain algorithms. The algorithms we have used in the benchmark are all very common in signal processing today. The results we have reached in this thesis have been compared to benchmarks for other processors, performed by Berkeley Design Technology, Inc. The algorithms were programmed in assembly code and then executed on the instruction set simulator. After that, we proposed changes to the instruction set, with the aim to reduce the execution time for the algorithms. The results from the benchmark show that our processor is at the same level as the ones tested by BDTI. Probably would a more experienced programmer be able to reduce the cycle count even more, especially for some of the more complex benchmarks.


Correlation and Power Spectrum

Duraisamy Sundararajan

In the signals and systems course and in the first course in digital signal processing, a signal is, most often, characterized by its amplitude spectrum in the frequency-domain and its amplitude profile in the time-domain. So much a student gets used to this type of characterization, that the student finds it difficult to appreciate, when encountered in the ensuing statistical signal processing course, the fact that a signal can also be characterized by its autocorrelation function in the time-domain and the corresponding power spectrum in the frequency-domain and that the amplitude characterization is not available. In this article, the characterization of a signal by its autocorrelation function in the time-domain and the corresponding power spectrum in the frequency-domain is described. Cross-correlation of two signals is also presented.


Digital Signal Processing Maths

Markus Hoffmann
1 comment

Modern digital signal processing makes use of a variety of mathematical techniques. These techniques are used to design and understand efficient filters for data processing and control.


DSP Memory Management in a Third Generation High Performance Base Station

Lasse Haverinen

Most of the tasks in a mobile cellular network base station are performed with programmable digital signal processors. Their memory spaces and management features are very limited. The buffering requirements in the base station can have large instantaneous variations during the simultaneous transmission of burst' data on multiple channels to multiple users. In particular the high bit-rates of the Wideband Code Division Multiple Access data transfer evolution High Speed Downlink Packet Access create very high demands for buffering. The fragmentation of the buffer memory is a threat. It causes a gradual decrease in performance, which is critical in a long running process like the base station. The amount of fragmentation is different with different memory management methods. In this work the features and applicability of different memory management methods for signal processors used in the base stations of third generation cellular networks have been studied. Software based memory management includes a high amount of conditional branches. The signal processor, which is optimized for highly parallel sequential computing, executes conditional branches very badly when compared to microcontrollers and general-purpose processors. The memory management methods are first studied in theory and then experimentally. In the experiments two different memory management methods were analyzed. The memory managers were loaded with a synthetic workload program that simulates multi-user high bit-rate data transmissions in the base station. The performances of the memory managers were measured in terms of fragmentation, execution time and memory utilization. The experiments confirmed the information gained from the theoretical studies that different memory management methods are usually optimized for a certain feature. The experiments showed that a simple method is fast to execute and works well with small and intermediate loads. When the load is increased the performance decreases. The second, more complex, measured method was found to require more computing, but to be capable of using the memory space assigned to it more effectively.


EngD thesis: Reduced-Complexity Signal Detection in Digital Communications Receivers

Julian Penketh

The Author began this Engineering Doctorate (EngD) in Autumn 1999, whilst already in full-time employment as a DSP software engineer at Nortel Networks, Harlow. This EngD comprises a set of three projects. The first project was focused on the development of dual-tone multi-frequency (DTMF) signal detection software. DTMF signals are currently used for operating menu-driven services such as voice-mail, telephone banking and share-dealing. The need for detection software in a packet networking environment exists because such signals become degraded when they travel through speech compression circuits. In addition, if they appear as echoes on a telephone line, they can affect the operation of echo cancellation systems. In this thesis a number of DSP algorithms are discussed where fast detection and minimum complexity are key characteristics. A key contribution here was the development of a novel detection algorithm based on the extraction of parameters that model the DTMF signal. The thesis reports a method combining parameter extraction with the technique of maximum likelihood to perform DTMF detection, resulting in very short time-frames when compared to standard approaches. Reducing the complexity of detection techniques is also important in today’s communication systems. To this end a key contribution here was the development of the ‘split Goertzel algorithm’, which permitted an overlapping of analysis windows without the need for reprocessing input data. Besides being applied to voice-band signals, such as in the case of DTMF, the Author also had the opportunity during the EngD to apply the skills and knowledge acquired in signal processing to higher-rate data-streams. This involved work concerning the equalisation of channel distortion commonly found in wireless communication systems. This covers two projects, with the first being conducted at Verticalband Ltd. (now no longer operational) in the area of digital television receivers. In this part of the thesis a real-time DSP implementation is discussed for enhancing a simulation system developed for wireless channels. A number of channel equalisation approaches are studied. The work concludes that for high-rate signals, non-linear algorithms have the best error rate performance. On the basis of the studies carried out, the thesis considers development and implementation issues of designs based on the decision feedback equaliser. The thesis reports on a software design which applies the method of least squares to carry out filter coefficient adaptation. The Verticalband studies reported lead on to further research within the context of channel equalisation, in the context of the detection of data in multiple-input multiple-output (MIMO) wireless local area network (WLAN) systems. This work was undertaken at Philips Research in Eindhoven, The Netherlands. The thesis discusses implementation scenarios of multi-element antenna arrays that aim to provide bit-rates orders of magnitude higher than today’s WLAN offerings, as those required by emerging standards such as 802.11n. The complexity of optimal detection techniques, such as maximum likelihood, scales exponentially with the number of transmit antennas. This translates to high processing costs and power consumption, rendering such techniques unsuitable for use in battery-powered devices. The initial main contribution here was the analysis of the complexity of algorithms whose performance had already been tested, such as the non-linear maximum likelihood approach and also less complex methods utilising linear filtering. This resulted in the development of new formulae to predict processing costs of algorithms based on the number of transmit and receive antennas. Other key contributions were defining a method to reduce the complexity of matrix inversion when using the Moore-Penrose pseudo-inverse, and the application of matrix decomposition to obviate the need for costly matrix inversion at all. Some on-going research into sub-optimal detection is also discussed, which describes methods to reduce the search-space for the maximum likelihood algorithm.