Longer lived cache for opencl_fft_app #134

alexfikl · 2022-08-18T07:11:41Z

#131 and #132 fixed the long codegen times, but there seem to still be a few runtime issues around. This is the first one I found from a quick profile.

Caching on the cl_context doesn't seem like a particularly good idea though. Ideally we could bubble this up and let pytential cache it on the array context. Thoughts?

For some context: this seems to be happening because of my moving meshes, i.e. I keep recreating the QBXLayerPotentialSource, which recreates the sumpy wranglers and clears all the caches.

alexfikl · 2022-08-18T07:12:18Z

For a pretty picture from speedscope:

alexfikl · 2022-08-18T07:18:51Z

This seems to work in practice too, from one time step I was seeing 25-26s ish

[0009/0313] t = 4.51667e-02/1.57080e+00 dt = 5.01852e-03 (wall **2.622e+01s** cpu 3.1x)

which became 8-9s ish

[0009/0313] t = 3.76389e-01/1.57080e+00 dt = 5.01852e-03 (wall 7.697e+00s cpu 8.9x)

(everything else being the same)

inducer · 2022-08-18T21:05:25Z

(BEGIN UNRELATED TRAIN OF THOUGHT)
Technically, I think this ought to apply to more than just the FFT (in particular the translation operators). However, those get cached inside loopy kernels, which in turn live in global caches. This leads to CL contexts being held indefinitely unless we merge inducer/loopy#671, which I just made.
(END UNRELATED TRAIN OF THOUGHT)

I agree that shoving this in the CL context is not ideal, because it leads to a reference cycle that will never be garbage-collected. At some point (hopefully not too far into the future), array contexts will make their way into sumpy, at which point we could just use them here. Would you be willing to carry this diff yourself until that point?

alexfikl · 2022-08-18T22:09:41Z

Would you be willing to carry this diff yourself until that point?

Yeah, I was thinking sumpy isn't that far off from having array context support too, so I'm fine with waiting with this until it gets there.

isuruf · 2022-08-19T16:38:04Z

I'm confused as to why this would cause a performance improvement. POCL has a global cache and that should make clBuildProgram return quickly.

alexfikl · 2022-08-19T16:47:06Z

@isuruf My guess was that initializeVkFFT in
https://raw.githubusercontent.com/DTolm/VkFFT/master/vkFFT/vkFFT.h
was the one causing issues because it's pretty gigantic. I didn't have time to look into it in detail, but does it construct some plan for the FFT or other precomputations?

isuruf · 2022-08-19T16:49:00Z

Maybe that is a bottleneck. I was looking at the profile you attached in a previous comment at #134 (comment) which shows that clBuildProgram takes up most of the time.

alexfikl · 2022-08-19T16:55:07Z

Maybe that is a bottleneck. I was looking at the profile you attached in a previous comment at #134 (comment) which shows that clBuildProgram takes up most of the time.

Hm, I used py-spy to get the profile, so maybe a sampling error? Assuming clBuildProgram is very fast when it hits the cache.

Would running it through cProfile be more helpful?

isuruf · 2022-08-19T16:57:24Z

Would running it through cProfile be more helpful?

Yes please. Is this the same code as #129 (comment)?

alexfikl · 2022-08-19T17:02:37Z

Yes please. Is this the same code as #129 (comment)?

Nope, this is another code that does some time stepping with the Stokes velocity field. Running that should also show how long it takes to initialize pyvkfft for a similar setup, though.

I'll run it through cProfile and report back!

alexfikl · 2022-08-19T17:53:31Z

profile..zip

@isuruf This should contain the cProfile results (pstats file) and the py-spy results (json file) with current sumpy@main. The run is doing 5 time steps to catch the repeated calls to get_opencl_app.

Both of them seem to show that there's quite a bit of time spent in get_opencl_fft_app, but not sure why py-spy is blaming clBuildProgram.

isuruf · 2022-08-19T18:19:10Z

Thanks. I can see that it is indeed quite slow. I'll have a look to see if that can be fixed.

alexfikl marked this pull request as draft August 18, 2022 22:09

longer lived memoize for opencl_fft_app

25d7ae6

alexfikl force-pushed the cache-opencl-app branch from 9c52e77 to 25d7ae6 Compare August 22, 2022 06:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Longer lived cache for opencl_fft_app #134

Longer lived cache for opencl_fft_app #134

alexfikl commented Aug 18, 2022 •

edited

Loading

alexfikl commented Aug 18, 2022

alexfikl commented Aug 18, 2022

inducer commented Aug 18, 2022

alexfikl commented Aug 18, 2022

isuruf commented Aug 19, 2022

alexfikl commented Aug 19, 2022

isuruf commented Aug 19, 2022

alexfikl commented Aug 19, 2022

isuruf commented Aug 19, 2022

alexfikl commented Aug 19, 2022

alexfikl commented Aug 19, 2022

isuruf commented Aug 19, 2022

Longer lived cache for opencl_fft_app #134

Are you sure you want to change the base?

Longer lived cache for opencl_fft_app #134

Conversation

alexfikl commented Aug 18, 2022 • edited Loading

alexfikl commented Aug 18, 2022

alexfikl commented Aug 18, 2022

inducer commented Aug 18, 2022

alexfikl commented Aug 18, 2022

isuruf commented Aug 19, 2022

alexfikl commented Aug 19, 2022

isuruf commented Aug 19, 2022

alexfikl commented Aug 19, 2022

isuruf commented Aug 19, 2022

alexfikl commented Aug 19, 2022

alexfikl commented Aug 19, 2022

isuruf commented Aug 19, 2022

alexfikl commented Aug 18, 2022 •

edited

Loading