[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: efficient summing of vector.

To: Camm Maguire <camm@enhanced.com>
Subject: Re: efficient summing of vector.
From: Peter Soendergaard <soender@cs.utk.edu>
Date: Mon, 12 Mar 2001 10:34:30 -0500 (EST)
cc: R Clint Whaley <rwhaley@cs.utk.edu>, atlas-comm@cs.utk.edu
In-Reply-To: <54y9ubxb3l.fsf@intech19.enhanced.com>


Hi Camm.

I must admit that I am already using your 12 instruction sequence. I could
not read you macro code, so I used objdump --disassemble to read your code
instead :-)

I was just looking for something efficient for the complex situation,
where the result can not be written to continous memory.

Cheers and thanks,

Peter


On 12 Mar 2001, Camm Maguire wrote:

> Hi Peter!  
> 
> Peter Soendergaard <soender@cs.utk.edu> writes:
> 
> > Hi Camm, master in the ways of intel-assembly.
> > 
> 
> *please*, not true at all!  I'm fishing around in the dark like
>  everyone else!
> 
> > You once wrote that you shaved an instruction of the way I sum a sse
> > register. I use a sequence like this to sum the register in #reg using
> 
> This, if memory serves, was not in the vector sum at the end, of the
> k-loop, but in the main block looping over the 4 columns of A doing
> the add-multiply.  There is also a way to make the "C write" step more
> efficient, I think, but that's not what I was referring to above.  The
> best strategy for the latter that I've thought of so far seems to lie
> in combining the fragments of the various C results (where possible)
> into the same registers, doubling the effective workload of a given
> movhlps, and winding up with the final 4 (single precision) C answers,
> (when the problem specifies that they will be contiguous), in a single
> register, and written out to memory in a single step.
> 
> This for example is my windup step for SREAL:
> 
> #define z f(t0,0,cx) pc(4,0) pul(5,4) pc(6,1) puh(5,0) pul(7,6)  \
>           pa(0,4) puh(7,1) pc(4,2) pa(1,6) ps(68,6,4) ps(238,6,2) pa(4,2) pu(2,0,cx)
> 
> i.e.
> 	"movaps %%xmm4,%%xmm0\n\t"
> 	"unpcklps %%xmm5,%%xmm4\n\t"
> 	"movaps %%xmm6,%%xmm1\n\t"
> 	"unpckhps %%xmm5,%%xmm0\n\t"
> 	"unpcklps %%xmm7,%%xmm6\n\t"
> 	"addps %%xmm0,%%xmm4\n\t"
> 	"unpckhps %%xmm7,%%xmm1\n\t"
> 	"movaps %%xmm4,%%xmm2\n\t"
> 	"addps %%xmm1,%%xmm6\n\t"
> 	"shufps $68,%%xmm6,%%xmm4\n\t"
> 	"shufps $238,%%xmm6,%%xmm2\n\t"
> 	"addps %%xmm4,%%xmm2\n\t"
> 	"movups %%xmm2,(%ecx)\n\t"
> 
> Sorry this is so rushed and unclear.  Of course, this needs to be
> changes somewhat for the other cases.
> 
> Take care,
> 
> > xmm7 as scratch, and it seems like a clumsy way to do it. How can it be
> > done in 4 instructions?
> > 
> >         __asm__ __volatile__ ("movhlps " #reg ", %%xmm7\n"\
> >     			      "addps " #reg ", %%xmm7\n"\
> >     			      "movaps %%xmm7, " #reg "\n"\
> >                               "shufps $1, " #reg ", %%xmm7\n"\
> >     			      "addss %%xmm7, " #reg "\n"\
> > 
> > 
> > Hope you can help me,
> > 
> > Cheers,
> > Peter
> > 
> > 
> > 
> 
> -- 
> Camm Maguire			     			camm@enhanced.com
> ==========================================================================
> "The earth is but one country, and mankind its citizens."  --  Baha'u'llah
>

References:
- Re: efficient summing of vector.
  - From: Camm Maguire <camm@enhanced.com>

Prev by Date: Re: efficient summing of vector.
Next by Date: RE: Speeding up ATLAS build time
Prev by thread: Re: efficient summing of vector.
Next by thread: Re: subarch builds
Index(es):
- Date
- Thread