Hullo, I am learning asm not from to many days
till now, and I am trying to write my own
asm foutine to normalize a vector of three floats
Here is what I have invented myself
_asm_normalize10:; Function begin
push ebp ; 002E _ 55
mov ebp, esp ; 002F _ 89. E5
mov eax, dword [ebp+8H] ; 0031 _ 8B. 45, 08
fld dword [eax] ; 0034 _ D9. 00
fmul st0, st(0) ; 0036 _ DC. C8
fld dword [eax+4H] ; 0038 _ D9. 40, 04
fmul st0, st(0) ; 003B _ DC. C8
fld dword [eax+8H] ; 003D _ D9. 40, 08
fmul st0, st(0) ; 0040 _ DC. C8
faddp st1, st(0) ; 0042 _ DE. C1
faddp st1, st(0) ; 0044 _ DE. C1
fsqrt ; 0046 _ D9. FA
fld1 ; 0048 _ D9. E8
fdivrp st1, st(0) ; 004A _ DE. F1
fld dword [eax] ; 004C _ D9. 00
fmul st(0), st1 ; 004E _ D8. C9
fstp dword [eax] ; 0050 _ D9. 18
fld dword [eax+4H] ; 0052 _ D9. 40, 04
fmul st(0), st1 ; 0055 _ D8. C9
fstp dword [eax+4H] ; 0057 _ D9. 58, 04
fld dword [eax+8H] ; 005A _ D9. 40, 08
fmulp st1, st(0) ; 005D _ DE. C9
fstp dword [eax+8H] ; 005F _ D9. 58, 08
pop ebp ; 0062 _ 5D
ret ; 0063 _ C3
; _asm_normalize10 End of function
(about 90 cycles on my old (too old possibly ;-)
pentium 4 home machine)
How could it be improved
(I am interested mainly in pure FPU version,
but also SSE2 version (for my p4) would be later
welcome, and maybe also AVX tiptop version for
newest CPU also then [can be routine for calculating
4 or 8 normalizations at once]
(but nown mainly I am interested in solid old
FPU code)
TNX for help with that