Development of a real time test platform for motor drive algorithms
In this thesis a real time test platform for a permanent magnet synchronous motor is developed. The implemented algorithm is Field Oriented Control (FOC) and it is implemented on a Texas Instruments TMS320F2808 Digital Signal Processor (DSP). The platform is developed in a rapid prototyping approach using Matlab/Simulink and the Real Time Workshop (RTW) packages.With this software the control algorithm and its interface to different DSP modules, such as A/D converter and PWM module, is constructed as a Simulink block scheme. The blocks used come from ordinary Simulink libraries and libraries provided by the RTW packages. From the Simulink block scheme Matlab can auto generate embedded C code adapted for different embedded targets, in this case the 2808 DSP.The developed real time test platform is also a Simulink model, though different from the algorithm model. When the start simulation command is given in the platform model a Graphical User Interface is loaded which lets the user specify motor parameters and certain algorithm parameters. Once the parameters are chosen RTW generates code from the algorithm model, loads it into the DSP and runs the generated program. From the platform model it is possible to set the reference speed of the motor in real time and monitor/log motor parameters such as actual speed and stator currents.
Optimization of Audio Processing algorithms (Reverb) on ARMv6 family of processors
Audio processing algorithms are increasingly used in cell phones and today’s customers are placing more demands on cell phones. Feature phones, once the advent of mobile phone technology, nowadays do more than just providing the user with MP3 play back or advanced audio effects. These features have become an integral part of medium as well as low-end phones. On the other hand, there is also an endeavor to include as improved quality as possible into products to compete in market and satisfy users’ needs. Tackling the above requirements has been partly satisfied by the advance in hardware design and manufacturing technology. However, as new hardware emerges into market the need for competence to write efficient software and exploit the new features thoroughly and effectively arises. Even though compilers are also keeping up with the new tide space for hand optimized code still exist. Wrapped in the above goal, an effort was made in this thesis to partly cover the competence requirement at Multimedia Section (part of Ericsson Mobile Platforms) to develope optimized code for new processors. Forging persistently ahead with new products, EMP has always incorporated the latest technology into its products among which ARMv6 family of processors has the main central processing role in a number of upcoming products. To fully exploit latest features provided by ARMv6, it was required to probe its new instruction set among which new media processing instructions are of outmost importance. In order to execute DSP-intensive algorithms (e.g. Audio Processing algorithms) efficiently, the implementation should be done in low-level code applying available instruction set. Meanwhile, ARMv6 comes with a number of new features in comparison with its predecessors. SIMD (Single Instruction Multiple Data) and VFP (Vector Floating Point) are the most prominent media processing improvements in ARMv6. Aligned with thesis goals and guidelines, Reverb algorithm which is among one of the most complicated audio features on a hand-held devices was probed. Consequently, its kernel parts were identified and implementation was done both in fixed-point and floating-point using the available resources on hardware. Besides execution time and amount of code memory for each part were measured and provided in tables and charts for comparison purposes. Conclusions were finally drawn based on developed code’s efficiency over ARM compiler’s as well as existing code already developed and tailored to ARMv5 processors. The main criteria for optimization was the execution time. Moreover, quantization effect due to limited precision fixed-point arithmetic was formulated and its effect on quality was elaborated. The outcomes, clearly indicate that hand optimization of kernel parts are superior to Compiler optimized alternative both from the point of code memory as well as execution time. The results also confirmed the presumption that hand optimized code using new instruction set can improve efficiency by an average 25%-50% depending on the algorithm structure and its interaction with other parts of audio effect. Despite its many draw backs, fixed-point implementation remains yet to be the dominant implementation for majority of DSP algorithms on low-power devices.
Automatic Parallel Memory Address Generation for Parallel DSP Computing
The concept of Parallel Vector (scratch pad) Memories (PVM) was introduced as one solution for Parallel Computing in DSP, which can provides parallel memory addressing efficiently with minimum latency. The parallel programming more efficient by using the parallel addressing generator for parallel vector memory (PVM) proposed in this thesis. However, without hiding complexities by cache, the cost of programming is high. To minimize the programming cost, automatic parallel memory address generation is needed to hide the complexities of memory access. This thesis investigates methods for implementing conflict-free vector addressing algorithms on a parallel hardware structure. In particular, match vector addressing requirements extracted from the behaviour model to a prepared parallel memory addressing template, in order to supply data in parallel from the main memory to the on-chip vector memory. According to the template and usage of the main and on-chip parallel vector memory, models for data pre-allocation and permutation in scratch pad memories of ASIP can be decided and configured. By exposing the parallel memory access of source code, the memory access flow graph (MFG) will be generated. Then MFG will be used combined with hardware information to match templates in the template library. When it is matched with one template, suited permutation equation will be gained, and the permutation table that include target addresses for data pre-allocation and permutation is created. Thus it is possible to automatically generate memory address for parallel memory accesses. A tool for achieving the goal mentioned above is created, Permutator, which is implemented in C++ combined with XML. Memory access coding template is selected, as a result that permutation formulas are specified. And then PVM address table could be generated to make the data pre-allocation, so that efficient parallel memory access is possible. The result shows that the memory access complexities is hiden by using Permutator, so that the programming cost is reduced.It works well in the context that each algorithm with its related hardware information is corresponding to a template case, so that extra memory cost is eliminated.
Voice Codec for Floating Point Processor
As part of an ongoing project at the department of electrical engineering, ISY, at Linköping University, a voice decoder using floating point formats has been the focus of this master thesis. Previous work has been done developing an mp3-decoder using the floating point formats. All is expected to be implemented on a single DSP.The ever present desire to make things smaller, more efficient and less power consuming are the main reasons for this master thesis regarding the use of a floating point format instead of the traditional integer format in a GSM codec. The idea with the low precision floating point format is to be able to reduce the size of the memory. This in turn reduces the size of the total chip area needed and also decreases the power consumption.One main question is if this can be done with the floating point format without losing too much sound quality of the speech. When using the integer format, one can represent every value in the range depending on how many bits are being used. When using a floating point format you can represent larger values using fewer bits compared to the integer format but you lose representation of some values and have to round the values off.From the tests that have been made with the decoder during this thesis, it has been found that the audible difference between the two formats is very small and can hardly be heard, if at all. The rounding seems to have very little effect on the quality of the sound and the implementation of the codec has succeeded in reproducing similar sound quality to the GSM standard decoder.
Evaluation of Image Warping Algorithms for Implementation in FPGA
The target of this master thesis is to evaluate the Image Warping technique and propose a possible design for an implementation in FPGA. The Image Warping is widely used in the image processing for image correction and rectification. A DSP is a usual choice for implantation of the image processing algorithms, but to decrease a cost of the target system it was proposed to use an FPGA for implementation. In this work a different Image Warping methods was evaluated in terms of performance, produced image quality, complexity and design size. Also, considering that it is not only Image Warping algorithm which will be implemented on the target system, it was important to estimate a possible memory bandwidth used by the proposed design. The evaluation was done by implemented a C-model of the proposed design with a finite datapath to simulate hardware implementation as close as possible.
Implementation of Algorithms on FPGAs
This thesis describes how an algorithm is transferred from a digital signal processor to an embedded microprocessor in an FPGA using C to hardware program from Altera. Saab Avitronics develops the secondary high lift control system for the Boeing 787 aircraft. The high lift system consists of electric motors controlling the trailing edge wing flaps and the leading edge wing slats. The high lift motors manage to control the Boeing 787 aircraft with full power even if half of each motor’s stators are damaged. The motor is a PMDC brushless motor which is controlled by an advanced algorithm. The algorithm needs to be calculated by a fast special digital signal processor. In this thesis I have tested if the algorithm can be transferred to an FPGA and still manage the time and safety demands. This was done by transferring an already working algorithm from the digital signal processor to an FPGA. The idea was to put the algorithm in an embedded NIOS II microprocessor and speed up the bottlenecks with Altera’s C to hardware program. The study shows that the C-code needs to be optimized for C to hardware to manage the up speeding part, as the tests showed that the calculation time for the algorithm actually became longer with C to hardware. This thesis also shows that it is highly probable to use an FPGA equipped with Altera’s NIOS II safety critical microprocessor instead of a digital signal processor to control the electrical high lift motors in the Boeing 787 aircraft.
DSP Platform Benchmarking
Benchmarking of DSP kernel algorithms was conducted in the thesis on a DSP processor for teaching in the course TESA26 in the department of Electrical Engineering. It includes benchmarking on cycle count and memory usage. The goal of the thesis is to evaluate the quality of a single MAC DSP instruction set and provide suggestions for further improvement in instruction set architecture accordingly. The scope of the thesis is limited to benchmark the processor only based on assembly coding. The quality check of compiler is not included. The method of the benchmarking was proposed by BDTI, Berkeley Design Technology Incorporations, which is the general methodology used in world wide DSP industry. Proposals on assembly instruction set improvements include the enhancement of FFT and DCT. The cycle cost of the new FFT benchmark based on the proposal was XX% lower, showing that the proposal was right and qualified. Results also show that the proposal promotes the cycle cost score for matrix computing, especially matrix multiplication. The benchmark results were compared with general scores of single MAC DSP processors offered by BDTI.
Efficient arithmetic for high speed DSP implementation on FPGAs
The author was sponsored by EnTegra Ltd, a company who develop hardware and software products and services for the real time implementation of DSP and RF systems. The field programmable gate array (FPGA) is being used increasingly in the field of DSP. This is due to the fact that the parallel computing power of such devices is ideal for today’s truly demanding DSP algorithms. Algorithms such as the QR-RLS update are computationally intensive and must be carried out at extremely high speeds (MHz). This means that the DSP processor is simply not an option. ASICs can be used but the expense of developing custom logic is prohibitive. The increased use of the FPGA in DSP means that there is a significant requirement for efficient arithmetic cores that utilises the resources on such devices. This thesis presents the research and development effort that was carried out to produce fixed point division and square root cores for use in a new Electronic Design Automation (EDA) tool for EnTegra, which is targeted at FPGA implementation of DSP systems. Further to this, a new technique for predicting the accuracy of CORDIC systems computing vector magnitudes and cosines/sines is presented. This work allows the most efficient CORDIC design for a specified level of accuracy to be found quickly and easily without the need to run lengthy simulations, as was the case before. The CORDIC algorithm is a technique using mainly shifts and additions to compute many arithmetic functions and is thus ideal for FPGA implementation.
Algorithm Adaptation and Optimization of a Novel DSP Vector Co-processor
The Division of Computer Engineering at Linköping's university is currently researching the possibility to create a highly parallel DSP platform, that can keep up with the computational needs of upcoming standards for various applications, at low cost and low power consumption. The architecture is called ePUMA and it combines a general RISC DSP master processor with eight SIMD co-processors on a single chip. The master processor will act as the main processor for general tasks and execution control, while the co-processors will accelerate computing intensive and parallel DSP kernels.This thesis investigates the performance potential of the co-processors by implementing matrix algebra kernels for QR decomposition, LU decomposition, matrix determinant and matrix inverse, that run on a single co-processor. The kernels will then be evaluated to find possible problems with the co-processors' microarchitecture and suggest solutions to the problems that might exist. The evaluation shows that the performance potential is very good, but a few problems have been identified, that causes significant overhead in the kernels. Pipeline mismatches, that occurs due to different pipeline lengths for different instructions, causes pipeline hazards and the current solution to this, doesn't allow effective use of the pipeline. In some cases, the single port memories will cause bottlenecks, but the thesis suggests that the situation could be greatly improved by using buffered memory write-back. Also, the lack of register forwarding makes kernels with many data dependencies run unnecessarily slow.
Implementation of Elementary Functions for a Fixed Point SIMD DSP Coprocessor
This thesis is about implementing the functions for reciprocal, square root, inverse square root and logarithms on a DSP platform. A multi-core DSP platform that consists of one master processor core and several SIMD coprocessor cores is currently being designed by a team at the Computer Engineering Department of Linköping University. The SIMD coprocessors’ arithmetic logic unit (ALU) has 16 multipliers to support vector multiplication instructions. By efficiently using the 16 multipliers, it is possible to evaluate polynomials very fast. The ALU does not have (hardware) support for floating point arithmetic, so the challenge is to get good precision by using fixed point arithmetic. Precise and fast solutions to implement the mathematical functions are found by converting the fixed point input to a soft floating point format before polynomial approximation, choosing a polynomial based on an error analysis of the polynomial approximation, and using Newton-Raphson or Goldschmidt iterations to improve the precision of the polynomial approximations. Finally, suggestions are made of changes and additions to the instruction set architecture, in order to make the implementations faster, by efficiently using the currently existing hardware.
EngD thesis: Reduced-Complexity Signal Detection in Digital Communications Receivers
The Author began this Engineering Doctorate (EngD) in Autumn 1999, whilst already in full-time employment as a DSP software engineer at Nortel Networks, Harlow. This EngD comprises a set of three projects. The first project was focused on the development of dual-tone multi-frequency (DTMF) signal detection software. DTMF signals are currently used for operating menu-driven services such as voice-mail, telephone banking and share-dealing. The need for detection software in a packet networking environment exists because such signals become degraded when they travel through speech compression circuits. In addition, if they appear as echoes on a telephone line, they can affect the operation of echo cancellation systems. In this thesis a number of DSP algorithms are discussed where fast detection and minimum complexity are key characteristics. A key contribution here was the development of a novel detection algorithm based on the extraction of parameters that model the DTMF signal. The thesis reports a method combining parameter extraction with the technique of maximum likelihood to perform DTMF detection, resulting in very short time-frames when compared to standard approaches. Reducing the complexity of detection techniques is also important in today’s communication systems. To this end a key contribution here was the development of the ‘split Goertzel algorithm’, which permitted an overlapping of analysis windows without the need for reprocessing input data. Besides being applied to voice-band signals, such as in the case of DTMF, the Author also had the opportunity during the EngD to apply the skills and knowledge acquired in signal processing to higher-rate data-streams. This involved work concerning the equalisation of channel distortion commonly found in wireless communication systems. This covers two projects, with the first being conducted at Verticalband Ltd. (now no longer operational) in the area of digital television receivers. In this part of the thesis a real-time DSP implementation is discussed for enhancing a simulation system developed for wireless channels. A number of channel equalisation approaches are studied. The work concludes that for high-rate signals, non-linear algorithms have the best error rate performance. On the basis of the studies carried out, the thesis considers development and implementation issues of designs based on the decision feedback equaliser. The thesis reports on a software design which applies the method of least squares to carry out filter coefficient adaptation. The Verticalband studies reported lead on to further research within the context of channel equalisation, in the context of the detection of data in multiple-input multiple-output (MIMO) wireless local area network (WLAN) systems. This work was undertaken at Philips Research in Eindhoven, The Netherlands. The thesis discusses implementation scenarios of multi-element antenna arrays that aim to provide bit-rates orders of magnitude higher than today’s WLAN offerings, as those required by emerging standards such as 802.11n. The complexity of optimal detection techniques, such as maximum likelihood, scales exponentially with the number of transmit antennas. This translates to high processing costs and power consumption, rendering such techniques unsuitable for use in battery-powered devices. The initial main contribution here was the analysis of the complexity of algorithms whose performance had already been tested, such as the non-linear maximum likelihood approach and also less complex methods utilising linear filtering. This resulted in the development of new formulae to predict processing costs of algorithms based on the number of transmit and receive antennas. Other key contributions were defining a method to reduce the complexity of matrix inversion when using the Moore-Penrose pseudo-inverse, and the application of matrix decomposition to obviate the need for costly matrix inversion at all. Some on-going research into sub-optimal detection is also discussed, which describes methods to reduce the search-space for the maximum likelihood algorithm.
Auditory System for a Mobile Robot
The auditory system of living creatures provides useful information about the world, such as the location and interpretation of sound sources. For humans, it means to be able to focus one's attention on events, such as a phone ringing, a vehicle honking, a person taking, etc. For those who do not suffer from hearing impairments, it is hard to imagine a day without being able to hear, especially in a very dynamic and unpredictable world. Mobile robots would also benefit greatly from having auditory capabilities. In this thesis, we propose an artificial auditory system that gives a robot the ability to locate and track sounds, as well as to separate simultaneous sound sources and recognising simultaneous speech. We demonstrate that it is possible to implement these capabilities using an array of microphones, without trying to imitate the human auditory system. The sound source localisation and tracking algorithm uses a steered beamformer to locate sources, which are then tracked using a multi-source particle filter. Separation of simultaneous sound sources is achieved using a variant of the Geometric Source Separation (GSS) algorithm, combined with a multisource post-filter that further reduces noise, interference and reverberation. Speech recognition is performed on separated sources, either directly or by using Missing Feature Theory (MFT) to estimate the reliability of the speech features. The results obtained show that it is possible to track up to four simultaneous sound sources, even in noisy and reverberant environments. Real-time control of the robot following a sound source is also demonstrated. The sound source separation approach we propose is able to achieve a 13.7 dB improvement in signal-to-noise ratio compared to a single microphone when three speakers are present. In these conditions, the system demonstrates more than 80% accuracy on digit recognition, higher than most human listeners could obtain in our small case study when recognising only one of these sources. All these new capabilities will allow humans to interact more naturally with a mobile robot in real life settings.
Towards a Real-Time Implementation of Loudness Enhancement Algorithms on a Motorola DSP 56600
Most of the cellular phone companies with audio speaker capabilities focus on reducing the current drain to extend battery life. None of these companies concentrate on modifying the speech signal itself to make it sound louder in noisy listener environments without adding additional energy. Such algorithms have been described in literature by Boillot and form the backbone of this thesis. The current project focusses on taking a step towards running these algorithms in real-time on a 16-bit fixed point Motorola DSP 56600. Implementation of the autocorrelation, Levinson- Durbin, FIR, and IIR filters in assembly for the Motorola DSP 56600 has been investigated in the thesis. The challenges and alternate solutions to circumvent the challenges have been described, and experimental results have been presented. Results indicate that the modified signed LMS algorithm, which can be considered to be a blend between the LMS and signed LMS algorithms, turns out to be an elegant solution to circumvent the challenges in implementing the Levinson-Durbin recursion.
A Two-Level Reconfigurable Cell Array for Digital Signal Processing
Reconfigurable hardware has become an attractive option for implementing digital signal processing, especially in systems that require both high performance and flexibility. This thesis presents a novel two-level reconfigurable architecture targeted toward systems with these requirements. The architecture supports a large orthogonal design space whereby designers can customize the word length, amount of parallelism, number of functional units, and functional unit connectivity to meet the needs of the application. On the upper level, algorithms are mapped onto an array of 4-bit cells and a hierarchical interconnection fabric. The interconnection structure contains a mesh of 4-bit busses for local data transfer, as well as an H-tree for communicating results between functional units. On the lower level, each cell contains a small matrix of elements that collectively implement all necessary operations. The matrix of elements has only two configurations: one optimized for mathematical functions such as multiply-accumulates, and the other optimized for memory operations. The system also contains pipeline latches to maximize clock rate and throughput. Circuit simulations indicate that the architecture achieves a clock frequency of 200 MHz in a modest 0.25-μm CMOS technology. An initial prototype of the reconfigurable cell has been fabricated in 0.5-μm CMOS and tested for functionality. The estimated execution time for a 16-bit, 256-point Fast Fourier Transform shows a speedup ranging from 1.6 to 14 compared to contemporary digital signal processors.
Image Analysis Using a Dual-Tree M-Band Wavelet Transform
We propose a 2D generalization to the M-band case of the dual-tree decomposition structure (initially proposed by N. Kingsbury and further investigated by I. Selesnick) based on a Hilbert pair of wavelets. We particularly address (i) the construction of the dual basis and (ii) the resulting directional analysis. We also revisit the necessary pre-processing stage in the M-band case. While several reconstructions are possible because of the redundancy of the representation, we propose a new optimal signal reconstruction technique, which minimizes potential estimation errors. The effectiveness of the proposed M- band decomposition is demonstrated via denoising comparisons on several image types (natural, texture, seismics), with various M-band wavelets and thresholding strategies. Signicant improvements in terms of both overall noise reduction and direction preservation are observed.
A DSP Implementation of OFDM Acoustic Modem
The success of multicarrier modulation in the form of OFDM in radio channels illuminates a path one could take towards high-rate underwater acoustic communications, and recently there are intensive investigations on underwater OFDM. In this paper, we implement the acoustic OFDM transmitter and receiver design of [4, 5] on a TMS320C6713 DSP board. We analyze the workload and identify the most time-consuming operations. Based on the workload analysis, we tune the algorithms and optimize the code to substantially reduce the synchronization time to 0.2 seconds and the processing time of one OFDM block to 1.7 seconds on a DSP processor at 225 MHz. This experimentation provides guidelines on our future work to reduce the per-block processing time to be less than the block duration of 0.23 seconds for real time operations.
Region based Active Contour Segmentation
In this paper, we propose a natural framework that allows any region-based segmentation energy to be re-formulated in a local way. We consider local rather than global image statistics and evolve a contour based on local information. Localized contours are capable of segmenting objects with heterogeneous feature profiles that would be difficult to capture correctly using a standard global method. The presented technique is versatile enough to be used with any global region-based active contour energy and instill in it the benefits of localization. We describe this framework and demonstrate the localization of three well-known energies in order to illustrate how our framework can be applied to any energy. We then compare each localized energy to its global counterpart to show the improvements that can be achieved. Next, an in-depth study of the behaviors of these energies in response to the degree of localization is given. Finally, we show results on challenging images to illustrate the robust and accurate segmentations that are possible with this new class of active contour models.
A NEW PARALLEL IMPLEMENTATION FOR PARTICLE FILTERS AND ITS APPLICATION TO ADAPTIVE WAVEFORM DESIGN
Sequential Monte Carlo particle filters (PFs) are useful for estimating nonlinear non-Gaussian dynamic system parameters. As these algorithms are recursive, their real-time implementation can be computationally complex. In this paper, we analyze the bottlenecks in existing parallel PF algorithms, and we propose a new approach that integrates parallel PFs with independent Metropolis-Hastings (PPF-IMH) algorithms to improve root mean-squared estimation error performance. We implement the new PPF-IMH algorithm on a Xilinx Virtex-5 field programmable gate array (FPGA) platform. For a onedimensional problem and using 1,000 particles, the PPF-IMH architecture with four processing elements utilizes less than 5% Virtex-5 FPGA resources and takes 5.85 μs for one iteration. The algorithm performance is also demonstrated when designing the waveform for an agile sensing application.
High speed data collection with Blackfin DSP
This report covers a master thesis in embedded systems, the goal of which was to investigate the high speed data collection capabilities with a Blackfin DSP. Basic theory about sampling and noise is covered briefly from a practical point of view. The theory is intended to be useful for those diving into a ADC datasheet for the first time. After an investigation of the delimiting factors, suitable components were selected and a prototype ADC PCB was designed from scratch. The goal is to design a general low noise data collecting unit compatible with the Blackfin DSP. Finally simple DSP software is designed to prove that DSP can handle such a high datastream.Testing the ADC card with the target Blackfin platform indicates thatthe analog parts indeed works. An analog bandwidth of over 10MHz ismeasured at a resolution exceeding 10 bits with respect to noise. The digital parts intended to interleave the two channels digital streams into one Blackfin unit did not work as intended. Only one channel is supported as of now. The report contains suggestions for future work in this area.






