Evaluation of Image Warping Algorithms for Implementation in FPGA

Anton Serguienko

The target of this master thesis is to evaluate the Image Warping technique and propose a possible design for an implementation in FPGA. The Image Warping is widely used in the image processing for image correction and rectification. A DSP is a usual choice for implantation of the image processing algorithms, but to decrease a cost of the target system it was proposed to use an FPGA for implementation. In this work a different Image Warping methods was evaluated in terms of performance, produced image quality, complexity and design size. Also, considering that it is not only Image Warping algorithm which will be implemented on the target system, it was important to estimate a possible memory bandwidth used by the proposed design. The evaluation was done by implemented a C-model of the proposed design with a finite datapath to simulate hardware implementation as close as possible.

Implementation of Algorithms on FPGAs

Mattias Karlsson

This thesis describes how an algorithm is transferred from a digital signal processor to an embedded microprocessor in an FPGA using C to hardware program from Altera. Saab Avitronics develops the secondary high lift control system for the Boeing 787 aircraft. The high lift system consists of electric motors controlling the trailing edge wing flaps and the leading edge wing slats. The high lift motors manage to control the Boeing 787 aircraft with full power even if half of each motor’s stators are damaged. The motor is a PMDC brushless motor which is controlled by an advanced algorithm. The algorithm needs to be calculated by a fast special digital signal processor. In this thesis I have tested if the algorithm can be transferred to an FPGA and still manage the time and safety demands. This was done by transferring an already working algorithm from the digital signal processor to an FPGA. The idea was to put the algorithm in an embedded NIOS II microprocessor and speed up the bottlenecks with Altera’s C to hardware program. The study shows that the C-code needs to be optimized for C to hardware to manage the up speeding part, as the tests showed that the calculation time for the algorithm actually became longer with C to hardware. This thesis also shows that it is highly probable to use an FPGA equipped with Altera’s NIOS II safety critical microprocessor instead of a digital signal processor to control the electrical high lift motors in the Boeing 787 aircraft.

DSP Platform Benchmarking

Luo Xinyuan
1 comment

Benchmarking of DSP kernel algorithms was conducted in the thesis on a DSP processor for teaching in the course TESA26 in the department of Electrical Engineering. It includes benchmarking on cycle count and memory usage. The goal of the thesis is to evaluate the quality of a single MAC DSP instruction set and provide suggestions for further improvement in instruction set architecture accordingly. The scope of the thesis is limited to benchmark the processor only based on assembly coding. The quality check of compiler is not included. The method of the benchmarking was proposed by BDTI, Berkeley Design Technology Incorporations, which is the general methodology used in world wide DSP industry. Proposals on assembly instruction set improvements include the enhancement of FFT and DCT. The cycle cost of the new FFT benchmark based on the proposal was XX% lower, showing that the proposal was right and qualified. Results also show that the proposal promotes the cycle cost score for matrix computing, especially matrix multiplication. The benchmark results were compared with general scores of single MAC DSP processors offered by BDTI.

Efficient arithmetic for high speed DSP implementation on FPGAs

Steven W. Alexander

The author was sponsored by EnTegra Ltd, a company who develop hardware and software products and services for the real time implementation of DSP and RF systems. The field programmable gate array (FPGA) is being used increasingly in the field of DSP. This is due to the fact that the parallel computing power of such devices is ideal for today’s truly demanding DSP algorithms. Algorithms such as the QR-RLS update are computationally intensive and must be carried out at extremely high speeds (MHz). This means that the DSP processor is simply not an option. ASICs can be used but the expense of developing custom logic is prohibitive. The increased use of the FPGA in DSP means that there is a significant requirement for efficient arithmetic cores that utilises the resources on such devices. This thesis presents the research and development effort that was carried out to produce fixed point division and square root cores for use in a new Electronic Design Automation (EDA) tool for EnTegra, which is targeted at FPGA implementation of DSP systems. Further to this, a new technique for predicting the accuracy of CORDIC systems computing vector magnitudes and cosines/sines is presented. This work allows the most efficient CORDIC design for a specified level of accuracy to be found quickly and easily without the need to run lengthy simulations, as was the case before. The CORDIC algorithm is a technique using mainly shifts and additions to compute many arithmetic functions and is thus ideal for FPGA implementation.

Algorithm Adaptation and Optimization of a Novel DSP Vector Co-processor

Andréas Karlsson

The Division of Computer Engineering at Linköping's university is currently researching the possibility to create a highly parallel DSP platform, that can keep up with the computational needs of upcoming standards for various applications, at low cost and low power consumption. The architecture is called ePUMA and it combines a general RISC DSP master processor with eight SIMD co-processors on a single chip. The master processor will act as the main processor for general tasks and execution control, while the co-processors will accelerate computing intensive and parallel DSP kernels.This thesis investigates the performance potential of the co-processors by implementing matrix algebra kernels for QR decomposition, LU decomposition, matrix determinant and matrix inverse, that run on a single co-processor. The kernels will then be evaluated to find possible problems with the co-processors' microarchitecture and suggest solutions to the problems that might exist. The evaluation shows that the performance potential is very good, but a few problems have been identified, that causes significant overhead in the kernels. Pipeline mismatches, that occurs due to different pipeline lengths for different instructions, causes pipeline hazards and the current solution to this, doesn't allow effective use of the pipeline. In some cases, the single port memories will cause bottlenecks, but the thesis suggests that the situation could be greatly improved by using buffered memory write-back. Also, the lack of register forwarding makes kernels with many data dependencies run unnecessarily slow.

Implementation of Elementary Functions for a Fixed Point SIMD DSP Coprocessor

Orri Tomasson

This thesis is about implementing the functions for reciprocal, square root, inverse square root and logarithms on a DSP platform. A multi-core DSP platform that consists of one master processor core and several SIMD coprocessor cores is currently being designed by a team at the Computer Engineering Department of Linköping University. The SIMD coprocessors’ arithmetic logic unit (ALU) has 16 multipliers to support vector multiplication instructions. By efficiently using the 16 multipliers, it is possible to evaluate polynomials very fast. The ALU does not have (hardware) support for floating point arithmetic, so the challenge is to get good precision by using fixed point arithmetic. Precise and fast solutions to implement the mathematical functions are found by converting the fixed point input to a soft floating point format before polynomial approximation, choosing a polynomial based on an error analysis of the polynomial approximation, and using Newton-Raphson or Goldschmidt iterations to improve the precision of the polynomial approximations. Finally, suggestions are made of changes and additions to the instruction set architecture, in order to make the implementations faster, by efficiently using the currently existing hardware.

Benchmarking a DSP processor

Per Lennartsson, Lars Nordlander

This Master thesis describes the benchmarking of a DSP processor. Benchmarking means measuring the performance in some way. In this report, we have focused on the number of instruction cycles needed to execute certain algorithms. The algorithms we have used in the benchmark are all very common in signal processing today. The results we have reached in this thesis have been compared to benchmarks for other processors, performed by Berkeley Design Technology, Inc. The algorithms were programmed in assembly code and then executed on the instruction set simulator. After that, we proposed changes to the instruction set, with the aim to reduce the execution time for the algorithms. The results from the benchmark show that our processor is at the same level as the ones tested by BDTI. Probably would a more experienced programmer be able to reduce the cycle count even more, especially for some of the more complex benchmarks.

Correlation and Power Spectrum

Duraisamy Sundararajan

In the signals and systems course and in the first course in digital signal processing, a signal is, most often, characterized by its amplitude spectrum in the frequency-domain and its amplitude profile in the time-domain. So much a student gets used to this type of characterization, that the student finds it difficult to appreciate, when encountered in the ensuing statistical signal processing course, the fact that a signal can also be characterized by its autocorrelation function in the time-domain and the corresponding power spectrum in the frequency-domain and that the amplitude characterization is not available. In this article, the characterization of a signal by its autocorrelation function in the time-domain and the corresponding power spectrum in the frequency-domain is described. Cross-correlation of two signals is also presented.

Digital Signal Processing Maths

Markus Hoffmann
1 comment

Modern digital signal processing makes use of a variety of mathematical techniques. These techniques are used to design and understand efficient filters for data processing and control.

DSP Memory Management in a Third Generation High Performance Base Station

Lasse Haverinen

Most of the tasks in a mobile cellular network base station are performed with programmable digital signal processors. Their memory spaces and management features are very limited. The buffering requirements in the base station can have large instantaneous variations during the simultaneous transmission of burst' data on multiple channels to multiple users. In particular the high bit-rates of the Wideband Code Division Multiple Access data transfer evolution High Speed Downlink Packet Access create very high demands for buffering. The fragmentation of the buffer memory is a threat. It causes a gradual decrease in performance, which is critical in a long running process like the base station. The amount of fragmentation is different with different memory management methods. In this work the features and applicability of different memory management methods for signal processors used in the base stations of third generation cellular networks have been studied. Software based memory management includes a high amount of conditional branches. The signal processor, which is optimized for highly parallel sequential computing, executes conditional branches very badly when compared to microcontrollers and general-purpose processors. The memory management methods are first studied in theory and then experimentally. In the experiments two different memory management methods were analyzed. The memory managers were loaded with a synthetic workload program that simulates multi-user high bit-rate data transmissions in the base station. The performances of the memory managers were measured in terms of fragmentation, execution time and memory utilization. The experiments confirmed the information gained from the theoretical studies that different memory management methods are usually optimized for a certain feature. The experiments showed that a simple method is fast to execute and works well with small and intermediate loads. When the load is increased the performance decreases. The second, more complex, measured method was found to require more computing, but to be capable of using the memory space assigned to it more effectively.