# Re: efficient summing of vector.

```
Hi Camm.

I was just looking for something efficient for the complex situation,
where the result can not be written to continous memory.

Cheers and thanks,

Peter

On 12 Mar 2001, Camm Maguire wrote:

> Hi Peter!
>
> Peter Soendergaard <soender@cs.utk.edu> writes:
>
> > Hi Camm, master in the ways of intel-assembly.
> >
>
> *please*, not true at all!  I'm fishing around in the dark like
>  everyone else!
>
> > You once wrote that you shaved an instruction of the way I sum a sse
> > register. I use a sequence like this to sum the register in #reg using
>
> This, if memory serves, was not in the vector sum at the end, of the
> k-loop, but in the main block looping over the 4 columns of A doing
> the add-multiply.  There is also a way to make the "C write" step more
> efficient, I think, but that's not what I was referring to above.  The
> best strategy for the latter that I've thought of so far seems to lie
> in combining the fragments of the various C results (where possible)
> into the same registers, doubling the effective workload of a given
> movhlps, and winding up with the final 4 (single precision) C answers,
> (when the problem specifies that they will be contiguous), in a single
> register, and written out to memory in a single step.
>
> This for example is my windup step for SREAL:
>
> #define z f(t0,0,cx) pc(4,0) pul(5,4) pc(6,1) puh(5,0) pul(7,6)  \
>           pa(0,4) puh(7,1) pc(4,2) pa(1,6) ps(68,6,4) ps(238,6,2) pa(4,2) pu(2,0,cx)
>
> i.e.
> 	"movaps %%xmm4,%%xmm0\n\t"
> 	"unpcklps %%xmm5,%%xmm4\n\t"
> 	"movaps %%xmm6,%%xmm1\n\t"
> 	"unpckhps %%xmm5,%%xmm0\n\t"
> 	"unpcklps %%xmm7,%%xmm6\n\t"
> 	"unpckhps %%xmm7,%%xmm1\n\t"
> 	"movaps %%xmm4,%%xmm2\n\t"
> 	"shufps \$68,%%xmm6,%%xmm4\n\t"
> 	"shufps \$238,%%xmm6,%%xmm2\n\t"
> 	"movups %%xmm2,(%ecx)\n\t"
>
> Sorry this is so rushed and unclear.  Of course, this needs to be
> changes somewhat for the other cases.
>
> Take care,
>
> > xmm7 as scratch, and it seems like a clumsy way to do it. How can it be
> > done in 4 instructions?
> >
> >         __asm__ __volatile__ ("movhlps " #reg ", %%xmm7\n"\
> >     			      "addps " #reg ", %%xmm7\n"\
> >     			      "movaps %%xmm7, " #reg "\n"\
> >                               "shufps \$1, " #reg ", %%xmm7\n"\
> >     			      "addss %%xmm7, " #reg "\n"\
> >
> >
> > Hope you can help me,
> >
> > Cheers,
> > Peter
> >
> >
> >
>
> --
> Camm Maguire			     			camm@enhanced.com
> ==========================================================================
> "The earth is but one country, and mankind its citizens."  --  Baha'u'llah
>

```