Hi, I'm transitioning my communications receiver floating point model (matlab) into a fixed point model, on the way towards a hardware implementation. My fixed point model using the matlab fixed point toolbox is running painfully slowly and I'm wondering if the group has _any_ ideas on things I might do to improve the simulation speed? I have profiled my model, and more than 95% of the sim time is in one calculation. a = b*c where a,b,c are fi objects. b and c are matrixes of size 512x2 and 2x1000 respectively. Furthermore, I have wrapped this into a function and complied it using the matlab mex compiler (emlmex) which gave me a 3x speed improvement. However, the profiler shows that most of the compiled mex file's time is spent calling embedded.fi.fi which is the fi object constructor. I don't understand why the constructor needs to be called? It feels like I'm doing something wrong because this constructor calling cripples the performance - or am I expecting too much from the fixed point toolbox? My floating point sim was getting through about 20kbits per second of payload data. The emlmex complied fixed point sim runs at 700 bits/s. I have googled long and hard, and posted several times on the matlab newsgroup but without success so far. Ultimately I may have to write in C, but I'm considering that a last resort at this stage. Cheers Andrew
Fixed point simulation speed: any clever ideas on why my fi objects are running so slow?
Started by ●May 27, 2008
Reply by ●May 27, 20082008-05-27
Andrew FPGA wrote:> Hi, > I'm transitioning my communications receiver floating point model > (matlab) into a fixed point model, on the way towards a hardware > implementation. My fixed point model using the matlab fixed point > toolbox is running painfully slowly and I'm wondering if the group has > _any_ ideas on things I might do to improve the simulation speed? > > I have profiled my model, and more than 95% of the sim time is in one > calculation. a = b*c where a,b,c are fi objects. b and c are matrixes > of size 512x2 and 2x1000 respectively. Furthermore, I have wrapped > this into a function and complied it using the matlab mex compiler > (emlmex) which gave me a 3x speed improvement. However, the profiler > shows that most of the compiled mex file's time is spent calling > embedded.fi.fi which is the fi object constructor. I don't understand > why the constructor needs to be called? It feels like I'm doing > something wrong because this constructor calling cripples the > performance - or am I expecting too much from the fixed point toolbox? > > My floating point sim was getting through about 20kbits per second of > payload data. The emlmex complied fixed point sim runs at 700 bits/s. > > I have googled long and hard, and posted several times on the matlab > newsgroup but without success so far. Ultimately I may have to write > in C, but I'm considering that a last resort at this stage. > > Cheers > AndrewAt a guess, it's doing the computation in floating point then saving them away as fixed point. I use Scilab instead of Matlab, but they're similar programs so I would expect some common behavior. In Scilab, if you're stepping through n to do a(n) = b * c, and you haven't created a beforehand, the program has to constantly create bigger and bigger versions of the array, which slows things down _considerably_. You didn't say if you were keeping a as a scalar or if you're saving it away as a vector, but this may be the problem. Ditto if you (or Matlab) is creating temporary storage, even if it's implicit in the middle of a calculation (such as if you take c'). You may want to look through the code, asking yourself "if I were an interpreter, where would I feel compelled to create fi objects?". You may find your answer that way. If you're lucky, you can then rearrange your simulation to not make that necessary, and speed things up thereby. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com Do you need to implement control loops in software? "Applied Control Theory for Embedded Systems" gives you just what it says. See details at http://www.wescottdesign.com/actfes/actfes.html
Reply by ●May 27, 20082008-05-27
Andrew FPGA wrote:> Hi, > I'm transitioning my communications receiver floating point model > (matlab) into a fixed point model, on the way towards a hardware > implementation. My fixed point model using the matlab fixed point > toolbox is running painfully slowly and I'm wondering if the group has > _any_ ideas on things I might do to improve the simulation speed? > > I have profiled my model, and more than 95% of the sim time is in one > calculation. a = b*c where a,b,c are fi objects. b and c are matrixes > of size 512x2 and 2x1000 respectively. Furthermore, I have wrapped > this into a function and complied it using the matlab mex compiler > (emlmex) which gave me a 3x speed improvement. However, the profiler > shows that most of the compiled mex file's time is spent calling > embedded.fi.fi which is the fi object constructor. I don't understand > why the constructor needs to be called? It feels like I'm doing > something wrong because this constructor calling cripples the > performance - or am I expecting too much from the fixed point toolbox? > > My floating point sim was getting through about 20kbits per second of > payload data. The emlmex complied fixed point sim runs at 700 bits/s. > > I have googled long and hard, and posted several times on the matlab > newsgroup but without success so far. Ultimately I may have to write > in C, but I'm considering that a last resort at this stage.The proper use of Matlab and its ilk is verifying algorithms, not simulating hardware. To execute a known sequence of computer ops, use assembler. To execute a approximately known sequence of computer ops, use a compiler whose foibles you understand. When you run a program that constructs and executes a model, don't expect time to be among the modeled parameters. Jerry -- Engineering is the art of making what you want from things you can get. �����������������������������������������������������������������������
Reply by ●May 27, 20082008-05-27
Thanks for you comments guys. @Tim - My calculation was literally a = b*c so a is a 2d matrix also. I was also preallocating a before the calc. I finally got a post on the matlab newsgroup that explained if returning an fi object from my compiled mex function(i am), then the fi constructor call is inevitable , even if a is preallocated. Since the majority of the total run time is in this one function there is little that can be done. The conclusion I draw from this is that one should expect a large slowdown if using matlab fi objects - if mathworks could rewrite parts of the fi toolbox in C (e.g. the constructor), or atleast use a style that allows it to be compiled by emlmex, then its performance would be a lot better. @Jerry ->>When you run a program that constructs and executes a model, don't >>expect time to be among the modeled parameters.Fair point, I guess I just wasn't expecting such a large slow down going from floating point to fixed point.
Reply by ●May 27, 20082008-05-27
Andrew FPGA <andrew.newsgroup@gmail.com> writes:> Thanks for you comments guys. > > @Tim - My calculation was literally a = b*c so a is a 2d matrix also. > I was also preallocating a before the calc. I finally got a post on > the matlab newsgroup that explained if returning an fi object from my > compiled mex function(i am), then the fi constructor call is > inevitable , even if a is preallocated. Since the majority of the > total run time is in this one function there is little that can be > done. The conclusion I draw from this is that one should expect a > large slowdown if using matlab fi objects - if mathworks could rewrite > parts of the fi toolbox in C (e.g. the constructor), or atleast use a > style that allows it to be compiled by emlmex, then its performance > would be a lot better. > > @Jerry - >>>When you run a program that constructs and executes a model, don't >>>expect time to be among the modeled parameters. > Fair point, I guess I just wasn't expecting such a large slow down > going from floating point to fixed point.Hi Andrew, I've seen a simulation that took days to run on Matlab execute in seconds when rewritten in simple C. My current methodology is to use Matlab for "exploratory" simulation, e.g., of a single, simple block just to verify an approach or to experiment. When it comes time for "formal" system simulations, I use C. But I answer a question you didn't answer... -- % Randy Yates % "With time with what you've learned, %% Fuquay-Varina, NC % they'll kiss the ground you walk %%% 919-577-9882 % upon." %%%% <yates@ieee.org> % '21st Century Man', *Time*, ELO http://www.digitalsignallabs.com
Reply by ●May 28, 20082008-05-28
> But I answer a question you didn't ask..Thanks for the comments Randy, I'm new at this so that question was in the back of my mind. I was really starting to question my whole approach. One thing I have noticed is that if the matlab code is not vectorized it runs very slowly. e.g. looping is to be avoided at all costs. In the very beginning when I was speeding up my matlab floating point model I found I could get very large performance improvements relatively easily, simply by vectorizing stuff. I speeded up my model more than 100x. Maybe this is a special case since my code was so bad (lots of for loops) to begin with. Anyhow, presumably that is well known to the folk on this newsgroup. Below is the response I got from a mathworks employee, and its gave me quite a bit more insight. He gets only a 10x performance hit between the floating point BLAS lib multiply calls and an emlmex compiled version of fi fixedpoint objects which seems far more reasonable. It feels right. I get worse slowdown than that so I'm going to fiddle a little more to see if I can get his performance - for example he creates the fi objects inside the emlmex compiled function, whereas I was passing them in. Hi Andrew, You are right that fixed-point simulation are slower in MATLAB. You have also found the best solution: emlmex. I tried your code. I wrote this function to try emlmex: function c_fi = fixProd(N) %#eml Fy = fimath('RoundMode','Floor','OverflowMode','Wrap',... 'ProductMode','KeepLSB','ProductWordLength',32,... 'SumMode','KeepLSB','SumWordLength',32); Ty = numerictype(1,16,8); a_fi = fi(rand(100),'numerictype',Ty,'fimath',Fy); b_fi = fi(rand(100),'numerictype',Ty,'fimath',Fy); c_fi = a_fi*b_fi; If I run this code in MATLAB, I get: (on the 2nd run) tic, a=fixProd(100); toc Elapsed time is 1.557883 seconds. Then I do this: emlmex fixProd -eg {100} -o emlmex_fixprod tic, a=emlmex_fixprod(100); toc tic, a=emlmex_fixprod(100); toc % run twice to eliminate DLL load overhead Elapsed time is 0.010871 seconds. That's two orders of magnitude improvement from emlmex vs. running the code in MATLAB. However, my floating point function still runs much faster: function c = floatProd(N) a = rand(N); b = rand(N); c = a*b; tic, a=floatProd(100); toc Elapsed time is 0.001297 seconds. So MATLAB floating point is still about 10 times faster than my emlmex-generated function. Why? That's because of what's happening under the hood. In floating point mode, for a linear algebra operation like matrix multiply, MATLAB uses the BLAS libraries which are processor optimized, i.e., a matrix multiply in MATLAB is blazingly fast (faster than a good C implementation). The same holds for other operations like svd, eig, fft (MATLAB uses fftw), etc. You really can't beat MATLAB's speed on these operations by writing C code (at least not without a lot of effort). For fixed-point operations, MATLAB does not use the BLAS; it emulates a fixed-point processor and keeps track of rounding/overflow modes, scaling, etc. This make fixed-point slow in MATLAB. When you "emlmex" these fixed-point functions, MATLAB generates a generic C implementation of these functions (i.e. not using processor-optimized libraries), and compiles those. So although this code uses generic integer datatypes in C, and will be much faster than running a fixed-point function directly in MATLAB, it really can't compete with the BLAS. So moral of the story: 1. What you're observing is correct. 2. A matrix multiply is not the best example for emlmex, 3. but in general it is the way to speed up MATLAB Fixed-point simulations. If you have Parallel Computing Toolbox, PARFOR could also be an option, but that's a whole different story (again, breaking up a matrix multiply operation wouldn't be the best candidate for using PARFOR; it's typically more efficient to just do an operation like that on a single core). PARFOR could be more useful in cases where you naturally have a FOR loop in your code (for instance, if you're looping over different SNR values for your system simulation). Hope this helps, Idin - Idin Motedayen-Aval The MathWorks, Inc. zq=[4 2 5 -15 -1 -3 24 -57 45 -12 19 -12 15 -8 3 -7 8 -69 53 12 -2]; char(filter(1,[1,-1],[105 zq])), clear zq
Reply by ●May 28, 20082008-05-28
On May 27, 4:23�am, Andrew FPGA <andrew.newsgr...@gmail.com> wrote:> Hi, > I'm transitioning my communications receiver floating point model > (matlab) into a fixed point model, on the way towards a hardware > implementation. My fixed point model using the matlab fixed point > toolbox is running painfully slowly and I'm wondering if the group has > _any_ ideas on things I might do to improve the simulation speed? >you need to rewrite the model using integer data types, the fixed point stuff is simulated "fixed point" capable of representing very large integers (64K bits I think), and thus has lots of overhead but is nice if you need that capability. How big/complex is your model? What hardware are going to run on?
Reply by ●May 28, 20082008-05-28
Andrew FPGA <andrew.newsgroup@gmail.com> writes:> [...] > One thing I have noticed is that if the matlab code is not vectorized > it runs very slowly. e.g. looping is to be avoided at all costs.Yes! I have encountered the exact same thing and made the exact same conclusion! Thanks for the other info on BLAS library optimations. -- % Randy Yates % "Midnight, on the water... %% Fuquay-Varina, NC % I saw... the ocean's daughter." %%% 919-577-9882 % 'Can't Get It Out Of My Head' %%%% <yates@ieee.org> % *El Dorado*, Electric Light Orchestra http://www.digitalsignallabs.com
Reply by ●May 28, 20082008-05-28
On May 27, 4:56 pm, Jerry Avins <j...@ieee.org> wrote:>...> The proper use of Matlab and its ilk is verifying algorithms, not > simulating hardware.what's "proper" is whatever you can do with it. you can simulate hardware in MATLAB just as you can with C or some other language. it might run very slow, but one can implement rounding or quantization at every mathematical operation. also clipping or wrapping (whatever it is that your hardware does).> To execute a known sequence of computer ops, use > assembler. To execute a approximately known sequence of computer ops, > use a compiler whose foibles you understand.you should always understand the foibles of the compiler (or whatever tool) we use. but we never get them all.> When you run a program that constructs and executes a model, don't > expect time to be among the modeled parameters.that's fer sure. r b-j
Reply by ●May 28, 20082008-05-28
> the fixed point stuff is simulated "fixed point" capable of representing very >large integers (64K bits I think), and thus has lots of overheadI'm using emlmex to compile the m code and so the fixed point data types are static. E.g. I have them set to 32 bit right now. Yes I agree up to 64kbits per word supported but I can't imagine matlab actually using 64kbits when I specify only 32?> How big/complex is your model?Its a receiver for continuous phase modulation (CPM) so the ML receiver has a bank of matched filters followed by a viterbi decoder. (I'm not tackling the synchronisation problem) In my case I have 256 matched filters and a Viterbi trellis with 128 states. The matched filters can pretty much be modelled with a single matrix multiply. This is the slowest part of the fixed point model. But its the fastest part in the floating point model (presumably because of the direct use of machine optimised BLAS lib calls). The viterbi decoder iterates symbol by symbol so has a loop that I could not vectorize - hence it is the bottle neck in the floating point model. It will be interesting to see what happens when I go fixed point on the viterbi decoder.> What hardware are going to run on?Low cost Xilinx FPGA. Symbol rates on the order of 10's of megasymbols per sec so FPGA is a requirement. I worked out the matched filters alone are about 80G mults/sec.






