About dev.opencl_c_version and atomic_fetch_add and OpenCL version #691

mikamyara · 2023-07-07T14:44:09Z

mikamyara
Jul 7, 2023

Hello All,
I am new to GPU computing / OpenCL. I made a kernel that requires some += operation shared between threads.
I found here that OpenCL initial version did not have the thread-wide += operation, and people usuall used code like :

  inline void atomicAdd_g_f(volatile __global float *addr, float val)
  {
  union {
  unsigned int u32;
  float f32;
  } next, expected, current;
  current.f32 = *addr;
  do {
  expected.f32 = current.f32;
  next.f32 = expected.f32 + val;
  current.u32 = atomic_cmpxchg( (volatile __global unsigned int *)addr,
  expected.u32, next.u32);
  } while( current.u32 != expected.u32 );
  }

It seems however this workaround is quite slow. For that reason, I searched for a more native "+=" operation. I found with OpenCL 3, there are operators :

// at top of kernel file :
#pragma OPENCL EXTENSION cl_ext_float_atomics : enable

// somewhere in my code :
atomic_fetch_add(ptr,value)

I did that on a recent Windows 10 workstation that contains a NVIDIA T400 4GB and a Intel(R) UHD Graphics 770. Whatever the GPU I choose, the #pragma is ignored and the atomic_fetch_add function is not found.
I checked the OpenCL version for PyOpenCL, for Nvidia and fir Intel GPUs, and found that :
PyOpenCL version: 2023.1.1 // OpenCL header version: 3.0
Nvidia : OpenCL 3.0 CUDA
Intel : OpenCL 3.0 NEO

So it seemed to be good. But I also found that another version is to be taken into account, the dev.opencl_c_version. For both GPU, it displays OpenCL 1.2.

I wanted to know where this limitation comes from ? Is it from the chips, the drivers, or from the libraries I use ? And how I can change this if doable ?
Or : is there something like a GPU assembly version of the add function I could use, through platform detection for example ?

Thanks a lot,
Mikhaël

Answered by inducer

Jul 7, 2023

Like so many things in HPC, that sounds like a trade-off, and it's hard to guess whether it's profitable or not---that depends on the details. You'd certainly be expending more memory bandwidth by having 10 arrays...

View full answer

kif · 2023-07-07T14:55:05Z

kif
Jul 7, 2023

I believe this issue comes from the nvidia driver who claim to support OpenCL3 but in practice, they only support OpenCL1.2. good luck with them.
I don't think the hack based on atomic_cmpxchg is that slow ... do you have evidences ? (beside commercial ads from nvidia)

1 reply

mikamyara Jul 7, 2023
Author

Thanks for your answer.

I do not have real evidences as I have no way to compare (else rewriting everything with Cuda). I just tried to use int instead of float and I gain a factor 3 about time. Don't know if it comes from the float vs int at hardware level or if it comes from the little hack slowliness.
About Nvidia I am not surprized, but I don't understand why it's the same for the Intel GPU.

inducer · 2023-07-07T15:00:16Z

inducer
Jul 7, 2023
Maintainer

A few thoughts on this:

Generally, the efficiency of atomic operations depends on the amount of contention. This is doubly true for the compare-and-swap version, since it has to redo a bunch of work in the case of an unsuccessful atomic. It sounds like you're testing a heavily-contented atomic; it's not a surprise that that's slow. Rewiring your algorithm to use reductions (see ReductionKernel) may be your best bet.
If you decide to stick with atomics, using inline PTX should let you access any device capability, whether explicitly exposed by the CL runtime or not.

0 replies

mikamyara · 2023-07-07T15:49:45Z

mikamyara
Jul 7, 2023
Author

Thanks for your answer. I will have a look at "reduction kernel" techniques. Just quickly : I make these sums in an array at various not really predictible positions, that come from a complex computation. I think perhaps creating for example 10 arrays instead of one may lead to less "competitions" about the atomic additions. Then, in the end, I could add the 10 arrays between them to make a single one. Sounds or not ?

I known nothing about inline PTX, I will have a look to.
Thanks for your advices !

2 replies

inducer Jul 7, 2023
Maintainer

Like so many things in HPC, that sounds like a trade-off, and it's hard to guess whether it's profitable or not---that depends on the details. You'd certainly be expending more memory bandwidth by having 10 arrays...

Answer selected by mikamyara

mikamyara Jul 7, 2023
Author

ok thanks, I will check that

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About dev.opencl_c_version and atomic_fetch_add and OpenCL version #691

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

About dev.opencl_c_version and atomic_fetch_add and OpenCL version #691

mikamyara Jul 7, 2023

Replies: 3 comments · 3 replies

kif Jul 7, 2023

mikamyara Jul 7, 2023 Author

inducer Jul 7, 2023 Maintainer

mikamyara Jul 7, 2023 Author

inducer Jul 7, 2023 Maintainer

mikamyara Jul 7, 2023 Author

mikamyara
Jul 7, 2023

Replies: 3 comments 3 replies

kif
Jul 7, 2023

mikamyara Jul 7, 2023
Author

inducer
Jul 7, 2023
Maintainer

mikamyara
Jul 7, 2023
Author

inducer Jul 7, 2023
Maintainer

mikamyara Jul 7, 2023
Author