Page 1 of 2 12 LastLast
Results 1 to 10 of 14

Thread: Number crunching using Single precision SSE regs

  1. #1

    Number crunching using Single precision SSE regs

    Intel & AMD (in 32 bit mode) support single precision arithmetic four numbers at the same time using the SSE registers.

    Just a few caveats though:

    This is ideal for vector processing but beware the loss of accuracy with division, reciprocals & square roots. Accuracy has been compromised for the sake of speed and I found that the Fibonacci number calculated in the example below is only accurate to 4 digits after applying the reciprocal.

    I also found that my AMD64 did not support operations directly from memory. All operands have to be loaded to the SSE registers first.

    Also, In older processors (pre 2006) FPU instructions cannot be mingled with SSE instructions since they share part of the same hardware.


    [code=thinbasic]

    ' Floating point vector maths using SIMD instructions

    uses "Oxygen"

    type Txyzw
    x as single
    y as single
    z as single
    w as single
    end type

    dim v1, v2 as Txyzw


    v1.x = 1
    v1.y = 2
    v1.z = 4
    v1.w = 8

    ' NB
    ' AMD does not support adding from memory eg addps xmm0,[#v1]
    ' loss of precision when doing division, square roots and reciprocals

    dim SSE_Demo as string = "
    'how accurate is 1/3 ?
    movups xmm0, [#v1] ' load
    movups xmm1, xmm0 ' Add
    addps xmm0, xmm1 ' Add
    addps xmm0, xmm1 ' Add
    divps xmm1, xmm0 ' divide
    mulps xmm1, xmm1 ' multiply
    sqrtps xmm1, xmm1 ' square root
    'rcpps xmm1, xmm1 ' reciprocal
    movups [#v2], xmm1 ' save
    'ret
    '
    ' First number is the fibonacci ratio?
    ' it should be 1.618033.. but does not quite make it
    movups xmm0, [#v1] ' load
    movups xmm1, xmm0 ' load
    addps xmm0, xmm0 ' double
    mulps xmm0, xmm0 ' square
    addps xmm0, xmm1 ' Add
    sqrtps xmm0, xmm0 ' square root
    subps xmm0, xmm1 ' subtract
    addps xmm1, xmm1 ' double
    divps xmm0, xmm1 ' divide
    rcpps xmm0, xmm0 ' reciprocal
    movups [#v2], xmm0 ' save
    ret

    "

    o2_asmo SSE_Demo
    if len(o2_error) then
    msgbox 0, "Assembly error"+$CRLF+O2_Error+$CRLF+O2_View(SSE_Demo)
    else
    o2_exec
    msgbox 0, STR$(v2.x)+STR$(v2.y)+STR$(v2.z)+STR$(v2.w)
    'msgbox 0,O2_View(SSE_Demo)
    endif

    [/code]

  2. #2
    thinBasic MVPs kryton9's Avatar
    Join Date
    Nov 2006
    Location
    Naples, Florida & Duluth, Georgia
    Age
    65
    Posts
    3,869
    Rep Power
    401

    Re: Number crunching using Single precision SSE regs

    Charles, I am not sure if you knew this or not. But nVidia seems to have written a language to create programs that use the power of the gpu for researchers. Since you are low level power coder, you might want to check this out. You have I think access to over 100 cores in the gpu at the moment

    It is called Cuda. Here is the link:
    http://www.nvidia.com/object/cuda_what_is.html
    Acer Notebook: Win 10 Home 64 Bit, Core i7-4702MQ @ 2.2Ghz, 12 GB RAM, nVidia GTX 760M and Intel HD 4600
    Raspberry Pi 3: Raspbian OS use for Home Samba Server and Test HTTP Server

  3. #3

    Re: Number crunching using Single precision SSE regs

    Thanks Kent - I will investigate. The Cuda driver which is downloading now, is quite a lump: 72 Megs!

    If we can devise light-weight support for GPU calculations it will be well worth the effort.

    Using the x86 SIMD instructions for matrix calculations is not ideal - only giving an x2 improvement over the FPU - there are some new instructions in SSE4 which improve matters but these made their appearance in 2007 so most CPUs out there won't support them.

  4. #4
    Super Moderator Petr Schreiber's Avatar
    Join Date
    Aug 2005
    Location
    Brno - Czech Republic
    Posts
    7,086
    Blog Entries
    5
    Rep Power
    725

    Re: Number crunching using Single precision SSE regs

    Hi Charles,

    thanks a lot but it did not worked on my AMD64. How to tweak it to make it run?


    Petr
    Learn 3D graphics with ThinBASIC, learn TBGL!
    Windows 10 64bit - Intel Core i5-3350P @ 3.1GHz - 16 GB RAM - NVIDIA GeForce GTX 1050 Ti 4GB

  5. #5

    Re: Number crunching using Single precision SSE regs

    Hi Petr,

    You could try commenting out instruction lines to see which ones are disruptive. We know that movups and addps works from your previous demo.

    My cpu is an Athlon 64 X2.

  6. #6
    Super Moderator Petr Schreiber's Avatar
    Join Date
    Aug 2005
    Location
    Brno - Czech Republic
    Posts
    7,086
    Blog Entries
    5
    Rep Power
    725

    Re: Number crunching using Single precision SSE regs

    All is ok,

    I had old oxygen DLL in that directory, don't know why :

    So it works well now, thanks!


    Petr
    Learn 3D graphics with ThinBASIC, learn TBGL!
    Windows 10 64bit - Intel Core i5-3350P @ 3.1GHz - 16 GB RAM - NVIDIA GeForce GTX 1050 Ti 4GB

  7. #7

    Re: Number crunching using Single precision SSE regs

    Phew!

    I've adapted an Intel example of 4x4 matrix multiply with SSE2 instructions. I don't understand the way it shuffles data around. but I'll post it here ASAP. perhaps you will be able to tell me how it works .

  8. #8

    Re: Number crunching using Single precision SSE regs


    Okay, here are two versions for 4x4 Matrix multiplication for comparison: SSE (Intel) and FPU (my code)

    [code=thinbasic]

    ' Floating point vector maths using SIMD instructions
    ' http://download.intel.com/design/Pen...l/24504501.pdf
    ' 4*4 Matrix multiply

    ' Also includes FPU-based alternative

    uses "Oxygen"

    type Txyzw
    x as single
    y as single
    z as single
    w as single
    end type

    dim va(16),vb(16),vc(16) as single

    'va(1) = 1
    'va(2) = 2
    'va(3) = 4
    'va(4) = 8

    va(1) = 1
    va(5) = 2
    va(9) = 4
    va(13) =8

    dim i as long
    for i=1 to 16 : vb(i)=1 : next


    dim SSE_Demo as string = "

    'call fpu_matrix_mul
    'ret

    ; see http://download.intel.com/design/Pen...l/24504501.pdf
    ;--------------
    sse_matrix_mul:
    ;--------------
    mov edx, #vb ; src1
    mov ecx, #va ; src2
    mov eax, #vc ; dst
    ;
    movss xmm0, [edx]
    movups xmm1, [ecx]
    shufps xmm0, xmm0, 0
    movss xmm2, [edx+4]
    mulps xmm0, xmm1
    shufps xmm2, xmm2, 0
    movups xmm3, [ecx+16]
    movss xmm7, [edx+8]
    mulps xmm2, xmm3
    shufps xmm7, xmm7, 0
    addps xmm0, xmm2
    movups xmm4, [ecx+32]
    movss xmm2, [edx+12]
    mulps xmm7, xmm4
    shufps xmm2, xmm2, 0
    addps xmm0, xmm7
    movups xmm5, [ecx+48]
    movss xmm6, [edx+16]
    mulps xmm2, xmm5
    movss xmm7, [edx+20]
    shufps xmm6, xmm6, 0
    addps xmm0, xmm2
    shufps xmm7, xmm7, 0
    movlps [eax], xmm0
    movhps [eax+8], xmm0
    mulps xmm7, xmm3
    movss xmm0, [edx+24]
    mulps xmm6, xmm1
    shufps xmm0, xmm0, 0
    addps xmm6, xmm7
    mulps xmm0, xmm4
    movss xmm2, [edx+36]
    addps xmm6, xmm0
    movss xmm0, [edx+28]
    movss xmm7, [edx+32]
    shufps xmm0, xmm0, 0
    shufps xmm7, xmm7, 0
    mulps xmm0, xmm5
    mulps xmm7, xmm1
    addps xmm6, xmm0
    shufps xmm2, xmm2, 0
    movlps [eax+16], xmm6
    movhps [eax+24], xmm6
    mulps xmm2, xmm3
    movss xmm6, [edx+40]
    addps xmm7, xmm2
    shufps xmm6, xmm6, 0
    movss xmm2, [edx+44]
    mulps xmm6, xmm4
    shufps xmm2, xmm2, 0
    addps xmm7, xmm6
    mulps xmm2, xmm5
    movss xmm0, [edx+52]
    addps xmm7, xmm2
    shufps xmm0, xmm0, 0
    movlps [eax+32], xmm7
    movss xmm2, [edx+48]
    movhps [eax+40], xmm7
    mulps xmm0, xmm3
    shufps xmm2, xmm2, 0
    movss xmm6, [edx+56]
    mulps xmm2, xmm1
    shufps xmm6, xmm6, 0
    addps xmm2, xmm0
    mulps xmm6, xmm4
    movss xmm7, [edx+60]
    shufps xmm7, xmm7, 0
    addps xmm2, xmm6
    mulps xmm7, xmm5
    addps xmm2, xmm7
    movups [eax+48], xmm2
    ret
    ;--------------





    ;--------------
    fpu_matrix_mul:
    ;--------------
    mov ecx,#va
    mov edx,#vb
    mov eax,#vc

    block:
    ;-----
    call column
    call column
    call column
    call column
    ret

    column:
    ;------
    call cell
    call cell
    call cell
    call cell
    add edx,16
    sub ecx,16
    ret

    cell: ' row A * column B
    ;-----------------------
    fld dword ptr [ecx ]
    fmul dword ptr [edx ]
    fld dword ptr [ecx+16]
    fmul dword ptr [edx+04]
    fld dword ptr [ecx+32]
    fmul dword ptr [edx+08]
    fld dword ptr [ecx+48]
    fmul dword ptr [edx+12]
    faddp st(1),st(0)
    faddp st(1),st(0)
    faddp st(1),st(0)
    fstp dword ptr [eax]
    add eax,4
    add ecx,4
    ret


    "

    o2_asmo SSE_Demo
    if len(o2_error) then
    msgbox 0, "Assembly error"+$CRLF+O2_Error+$CRLF+O2_View(SSE_Demo)
    else
    o2_exec
    msgbox 0, STR$(vc(01))+STR$(vc(02))+STR$(vc(03))+STR$(vc(04))+$cr_
    +STR$(vc(05))+STR$(vc(06))+STR$(vc(07))+STR$(vc(0)+$cr_
    +STR$(vc(09))+STR$(vc(10))+STR$(vc(11))+STR$(vc(12))+$cr_
    +STR$(vc(13))+STR$(vc(14))+STR$(vc(15))+STR$(vc(16)) '+$cr_
    'msgbox 0,O2_View(SSE_Demo)
    endif

    [/code]

  9. #9
    Super Moderator Petr Schreiber's Avatar
    Join Date
    Aug 2005
    Location
    Brno - Czech Republic
    Posts
    7,086
    Blog Entries
    5
    Rep Power
    725

    Re: Number crunching using Single precision SSE regs

    Hi Charles,

    cool code indeed!
    SSE performs 1.25x faster than your classic code. Not bad result for your solution, as I would expect SSE to make it lot faster


    Petr
    Learn 3D graphics with ThinBASIC, learn TBGL!
    Windows 10 64bit - Intel Core i5-3350P @ 3.1GHz - 16 GB RAM - NVIDIA GeForce GTX 1050 Ti 4GB

  10. #10

    Re: Number crunching using Single precision SSE regs


    There may be some overhead in your test loop Petr. Intel suggests 2.1x faster than their FPU equivalent. I'm getting 0.1 secs over 0.17 secs under PB. But in any case it is not a major leap in performance. SSE2 is not flexible enough to do this matrix stuff efficiently.

    There is still some additonal efficiency to squeeze out of the FPU version by expanding some of the subroutines and maybe using more registers - but this one was compact and simple to write.

    PS: Trouble with CUDA - Nvidia GeForce 7600 too old?

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •