[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Interesting discovery for the P4
I was just testing my fastest result for the P4, which is something like
2300 - 2500 mflops for double precision, and I discovered that the layout
of the code seemed to be totally irrelevant! I tried 4 different ways of
schedulling the instructions are the code all ran at exactly the same
speed. This seems to indicate that the trace cache of the P4 actually
works quite well. I could also mean that another factor (bandwidth) is
limiting the speed, but still I would expect bigger variations for the
different ways of schedulling.
So the P4 might be quite a good chip after all if code layout does not
have to be heavily optimized.