DSPRelated.com
Forums

OT: ASCII files and Intel P4 MMX instructions

Started by Rune Allnor November 4, 2006
Hi all.

A couple of weeks ago I wrote this small test program that took an
ASCII
file as input, did some processing and wrote the results back to an
ASCII
file. The input file contained some 4e6 numbers (2^22) on double
precision.
The file format is clean, one number per line, no fuzz. When I first
wrote
the program the time to load the data was some 70 seconds. The program
wrote a similar number of processed data back to file in some 30
seconds.

Then I got the new Turbo C++ compiler. I switched on P4 MMX instruction
support, and ran the same program again.

Now the program reads 4e6 data points in 12 seconds and writes them in
10. While I have played a bit with the settings in the compiler, I
can't think
of anything else I have done (and left active) that might explain this
sort
of speed-up

I don't know too much about what MMX does, other than it seems as
if those instructions act on 8-bit words in parallel. Is it reasonable
to assume
that the MMX instructions are responsible for speeding up this program,
or
may there be some other explanation? If so, what should I look for?

Rune

"Rune Allnor" schrieb
> > [MMX instructions speed up floating point calculations] >
It's not quite clear to me how you ran the program, but I'd expect x seconds the program runs for the first time, but only 2/3 or even less on the second and subsequent runs. The (HW) hard disk cache and (SW) OS disk caches keep parts of the file in memory, so the file access is much faster. I don't know if you took this into account, so I just wanted to point it out to you. OTOH, http://webster.cs.ucr.edu/AoA/Windows/HTML/TheMMXInstructionSet.html and http://www.tommesani.com/MMXPrimer.html seem to indicate (*) that MMX might indeed be the "culprit", although: - I don't know how efficiently Turbo-C translates your C code into specialized MMX code, the performances gains alluded to in those two texts seem to come from specialized assembly code. - This would probably (and I am not an expert here) not explain the speed-up you described for ***reading*** those 4 million data points, as this code is I/O bound. Another thing is that just switching the compiler to a newer (???) one might show such beneficial effects. Sorry for not being more helpful. Martin (*) these were just the first two hits from gooooogling, dunno if they are any good. They looked appropriate, though :-)
Martin Blume skrev:
> "Rune Allnor" schrieb > > > > [MMX instructions speed up floating point calculations]
Nope, the computation time stayed the same, the big difference was file IO.
> It's not quite clear to me how you ran the program,
This was a command-line stand-alone that loaded the file, did the computations and wrote the results to another file. I used the C++ CLOCK command to time the loading, computations and writings.
> but I'd expect x seconds the program runs for the first time, > but only 2/3 or even less on the second and subsequent runs. > The (HW) hard disk cache and (SW) OS disk caches keep parts > of the file in memory, so the file access is much faster.
Maybe so, but the file is some 50 MB in size. Slightly bigger than th file cache. The times are consistent between runs, the big difference occured after I compiled the program with the new compiler. Rune
"Rune Allnor" <allnor@tele.ntnu.no> wrote in message 
news:1162654315.059511.315030@e3g2000cwe.googlegroups.com...

> > ... the time to load the data was some 70 seconds. The > program > wrote a similar number of processed data back to file in > some 30 > seconds. > > Then I got the new Turbo C++ compiler. I switched on P4 > MMX > instruction support, and ran the same program again. > > Now the program reads 4e6 data points in 12 seconds > and writes them in 10. While I have played a bit with > the settings in the compiler, I can't think of > anything else I have done (and left active) that might > explain this sort of speed-up >
The answer will depend a lot on what your program is doing between "read 4e6 data points" and "write processed data points". If the answer is "essentially nothing", then one possible explanation is that the old runtime libraries (the libraries that implement stuff like open(), read(), write(), fprintf(), etc. are just horribly inefficient (or were perhaps compiled and distributed without optimization and targeted for a 386 instruction set to maximize portability). On the other hand, if you are doing stuff like convolution or correlation, it's my understanding that MMX would be directly useful and, if the new compiler is smart enough to figure out what you are doing, you probably have your answer. One caveat: I have heard many times that use of MMX is unsupportable in a multi-tasking (or multi-threaded) environment. I have also heard that MMX was ignored by Microsoft compilers for this reason. If someone knows something to the contrary, I would appreciate being set straight.
On 2006-11-04 11:31:55 -0400, "Rune Allnor" <allnor@tele.ntnu.no> said:

> Hi all. > > A couple of weeks ago I wrote this small test program that took an > ASCII > file as input, did some processing and wrote the results back to an > ASCII > file. The input file contained some 4e6 numbers (2^22) on double > precision. > The file format is clean, one number per line, no fuzz. When I first > wrote > the program the time to load the data was some 70 seconds. The program > wrote a similar number of processed data back to file in some 30 > seconds. > > Then I got the new Turbo C++ compiler. I switched on P4 MMX instruction > support, and ran the same program again. > > Now the program reads 4e6 data points in 12 seconds and writes them in > 10. While I have played a bit with the settings in the compiler, I > can't think > of anything else I have done (and left active) that might explain this > sort > of speed-up > > I don't know too much about what MMX does, other than it seems as > if those instructions act on 8-bit words in parallel. Is it reasonable > to assume > that the MMX instructions are responsible for speeding up this program, > or > may there be some other explanation? If so, what should I look for? > > Rune
The execution times of programs which do formatted input and output are often dominated by the conversion from and to the external character format. There are many stories of folks who carefully "optimize" their programs and "improve" the output by adding one additional line with the result that the program runs slower. So time the program with the computation stubbed out so all it does is input and output. The speed up from no computation will likely be small. Now try the same with the old compiler. A good bet is that the new compiler has better code in it run time support for the input and output conversion. This means that you have probably been pushing on a string as the changes have nothing to do with you and much to do with the change in compiler version. This high expense of formatted input/output has little to do with the language involved. Some languages have been tagged as slow in IO because they have more users doing formated output. There are also some large processing packages reputed as being fast, as they are usually applied to unformatted IO, which surpise thier users by being rather slow when the less common heavily formatted IO is the major use. Doing guaranteed last bit quality conversions is hard and subtle so all this is not quite as surprising as it might be. There are many other reasons for speed changes like allignment issues, cache hits, yada yada but IO is a good starter.

Rune Allnor wrote:

> > Then I got the new Turbo C++ compiler. I switched on P4 MMX instruction > support, and ran the same program again.
The C++ compiler from Intel (ICC) is known to do the best optimization for x86. It is even trying to make use of the SSE instructions.
> > Now the program reads 4e6 data points in 12 seconds and writes them in > 10. While I have played a bit with the settings in the compiler, I > can't think > of anything else I have done (and left active) that might explain this > sort > of speed-up
Borland was never known for a good optimization, but perhaps a new version does a better job on it. However this may be a random result due to the file caching, disk fragmentation, some system activity in the background or other irrelevant issues.
> I don't know too much about what MMX does, other than it seems as > if those instructions act on 8-bit words in parallel. Is it reasonable > to assume > that the MMX instructions are responsible for speeding up this program, > or > may there be some other explanation?
The MMX instructions can't do much help unless you are using the MMX library functions explicitly. If so, what should I look for?
>
"All evil in the world is because of the premature optimization" (c) Knuth Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com

John E. Hadstate wrote:


> > One caveat: I have heard many times that use of MMX is > unsupportable in a multi-tasking (or multi-threaded) > environment. I have also heard that MMX was ignored by > Microsoft compilers for this reason.
1. MMX is almost useless unless you do use it explicitly. The compiler can't make use of the MMX efficiently just by itself. 2. You can use the FPU either for MMX data or for a floating point data. You can't intermix MMX with float. This can be a big inconvenience. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
"Rune Allnor" schrieb 
>>> >>> [MMX instructions speed up floating point calculations] > > Nope, the computation time stayed the same, the big > difference was file IO. >
I'd venture to say that file I/O has almost nothing to do with MMX.
> > This was a command-line stand-alone that loaded the file, > did the computations and wrote the results to another file. > I used the C++ CLOCK command to time the loading, > computations and writings. >
loading and writing was faster? What about computations?
> Maybe so, but the file is some 50 MB in size. Slightly > bigger than th file cache. The times are consistent between > runs, the big difference occured after I compiled the program > with the new compiler. >
And what was the old compiler? Martin
Rune Allnor wrote:

(snip)

> The file format is clean, one number per line, no fuzz. When I first > wrote > the program the time to load the data was some 70 seconds. The program > wrote a similar number of processed data back to file in some 30 > seconds.
> Then I got the new Turbo C++ compiler. I switched on P4 MMX instruction > support, and ran the same program again.
Presumably the new compiler came with its own library, which could have a completely different implementation of the conversion routines. It might be that they use MMX, but it might just be differences in the coding of the I/O library routines. -- glen
Martin Blume skrev:
> "Rune Allnor" schrieb > >>> > >>> [MMX instructions speed up floating point calculations] > > > > Nope, the computation time stayed the same, the big > > difference was file IO. > > > I'd venture to say that file I/O has almost nothing to do > with MMX.
Well, if you can work on 4 chars in parallel, as seems to be the case with the MMX instruction set, it might have an effect on ASCI -> float conversions.
> > This was a command-line stand-alone that loaded the file, > > did the computations and wrote the results to another file. > > I used the C++ CLOCK command to time the loading, > > computations and writings. > > > loading and writing was faster? > What about computations?
The computation time stayed the same.
> > Maybe so, but the file is some 50 MB in size. Slightly > > bigger than th file cache. The times are consistent between > > runs, the big difference occured after I compiled the program > > with the new compiler. > > > And what was the old compiler?
The previous Borland compiler (2002 or so vintage). Rune