-
Hello All,
It seems however this workaround is quite slow. For that reason, I searched for a more native "+=" operation. I found with OpenCL 3, there are operators :
I did that on a recent Windows 10 workstation that contains a NVIDIA T400 4GB and a Intel(R) UHD Graphics 770. Whatever the GPU I choose, the #pragma is ignored and the atomic_fetch_add function is not found. So it seemed to be good. But I also found that another version is to be taken into account, the dev.opencl_c_version. For both GPU, it displays OpenCL 1.2. I wanted to know where this limitation comes from ? Is it from the chips, the drivers, or from the libraries I use ? And how I can change this if doable ? Thanks a lot, |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
I believe this issue comes from the nvidia driver who claim to support OpenCL3 but in practice, they only support OpenCL1.2. good luck with them. |
Beta Was this translation helpful? Give feedback.
-
A few thoughts on this:
|
Beta Was this translation helpful? Give feedback.
-
Thanks for your answer. I will have a look at "reduction kernel" techniques. Just quickly : I make these sums in an array at various not really predictible positions, that come from a complex computation. I think perhaps creating for example 10 arrays instead of one may lead to less "competitions" about the atomic additions. Then, in the end, I could add the 10 arrays between them to make a single one. Sounds or not ? I known nothing about inline PTX, I will have a look to. |
Beta Was this translation helpful? Give feedback.
Like so many things in HPC, that sounds like a trade-off, and it's hard to guess whether it's profitable or not---that depends on the details. You'd certainly be expending more memory bandwidth by having 10 arrays...