PDA

View Full Version : Information about new OpenCL headers



Petr Schreiber
04-09-2011, 18:16
Dear friends!,

recent advances in ThinBASIC syntax allowed me to remaster the OpenCL headers with, I hope, the most faithful translation of original Khronos 1.1 headers possible.

To start enjoying the technology, you will need:

GeForce 8 or newer series card, with latest drivers
ThinBASIC 1.8.9.0 (http://www.thinbasic.com/community/showthread.php?11345-thinBasic-1.8.9.0-available-as-stable-release&p=84526#post84526)
Latest OpenCL 1.1 headers for ThinBASIC (http://www.thinbasic.com/community/showthread.php?10159-OpenCL-Headers-Updated-Sep-04-2011)


Then you can download updated versions of OpenCL examples:

OpenCL: Gentle introduction to OpenCL (http://www.thinbasic.com/community/showthread.php?11347-OpenCL-Gentle-introduction-to-OpenCL) - completely new example
OpenCL: Device information (http://www.thinbasic.com/community/showthread.php?10175-OpenCL-Device-information-Updated-Sep-04-2011) - remastered
OpenCL: Hello World (http://www.thinbasic.com/community/showthread.php?10323-OpenCL-quot-Hello-World-quot-adapted-from-Apple-code-Updated-Sep-04-2011) - remastered
OpenCL: Vector add (http://www.thinbasic.com/community/showthread.php?10155-OpenCL-Vector-Add-Updated-Sep-04-2011) - remastered
OpenCL: Image processing (http://www.thinbasic.com/community/showthread.php?10327-OpenCL-Image-Processing-Test-Updated-Sep-04-2011) - remastered, many new kernels added, ability to save result to file


The examples and headers have been extensively tested on 3 different GPUs - GeForce G210M, GeForce GT320 and Quadro FX1800M (thanks Eros!) - but as usually with GPGPU computing stuff, be careful :)


Petr

zak
05-09-2011, 08:51
Hi Petr
my geforce 7 damaged recently from excessive heat during long process in my desktop pc, and now i am using the poor performance onboard graphics , so i will buy another card to be able to run your examples. my question: are all advanced graphics cards using PCI-E ?, since my damaged card was inserted in a slot PCI-E. there are also two PCI slots.
or it may use another slots wich i don't know.

Petr Schreiber
05-09-2011, 09:46
Hi Zak,

I am sorry to hear your GeForce 7 is down, I liked these series. Regarding upgrade - yes, PCI-E seems to be current standard for new GPUs. There is observable speed difference between PCI-E 1x and PCI-E 4x, but the higher speeds (16x) do not bring much more power comparing to 4x.

For OpenCL experiments, even the cheapest GeForce G210 is fine, but for some serious performance I would recommend anything higher. The key parameter for parallel performance is number of CUDA cores. Here the range is incredible. My G210M has 16 of them, but I have at home non-reference version of GTX260 which has 224 CUDA cores, and there are 512 core monsters on the market at the moment. I would say anything with 48+ CUDA cores should offer some interesting performance both for graphics and computing programming.

OpenCL 1.0 is supported on GeForce 8, 9, 2xx, 3xx series, the OpenCL 1.1 should be supported on GeForce 4xx, 5xx series and up.

At the moment I am investigating the situation in AMD Radeon land. Interesting is that from all my friends only 1 (one) has Radeon GPU which makes it a bit difficult :)


Petr

kryton9
06-09-2011, 02:10
Petr, I read the article you linked to above and it was a really great intro for coming to graps with how opencl is designed. One thing I did not understand is how he broke the work items and work groups down... is this decided by the number of cores available?

Petr Schreiber
06-09-2011, 08:12
Hi Kent,

I am happy to see the article got your interest :) For many tasks, you can let the OpenCL to do this division for you automagically, driver will take care of it.
Especially during the learning process, this is good helper so you can focus on other things.

When pushing the performance to the edge, the extensive use of local memory is good idea. This is where you might need to take more control of precise division to work groups, to not run out of local memory, which is often as small as few tens of KB per work item, comparing to hundreds of MB of slower global memory.

Here the author needed to make sure two groups are created (mostly to demonstrate the cooperation inside workgroup), so he forced it by telling to OpenCL specific wish.
You can see in the kernel code, that until reaching the barrier (= having the mini-sums ready), the output is written just and only to fast local memory. And after this is done, the whole group data is summed by first work-item in each group, and passed to global memory so it can be read back.

If you want, you can check the number of cores yourself (please see example (http://www.thinbasic.com/community/showthread.php?10175-OpenCL-Device-information-Updated-Sep-04-2011)) and then arrange the "topology" of the calculation yourself. But if you don't do, the driver again takes care of all the necessary operation, so if the work group pattern does not match the hardware, or if the problem is bigger than number of cores on the GPU, no explosion occurs and computation will run.


Petr