Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Longer lived cache for opencl_fft_app #134

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

alexfikl
Copy link
Contributor

@alexfikl alexfikl commented Aug 18, 2022

#131 and #132 fixed the long codegen times, but there seem to still be a few runtime issues around. This is the first one I found from a quick profile.

Caching on the cl_context doesn't seem like a particularly good idea though. Ideally we could bubble this up and let pytential cache it on the array context. Thoughts?

For some context: this seems to be happening because of my moving meshes, i.e. I keep recreating the QBXLayerPotentialSource, which recreates the sumpy wranglers and clears all the caches.

@alexfikl
Copy link
Contributor Author

For a pretty picture from speedscope:

Screenshot_20220818_100953

@alexfikl
Copy link
Contributor Author

This seems to work in practice too, from one time step I was seeing 25-26s ish

[0009/0313] t = 4.51667e-02/1.57080e+00 dt = 5.01852e-03 (wall **2.622e+01s** cpu 3.1x)

which became 8-9s ish

[0009/0313] t = 3.76389e-01/1.57080e+00 dt = 5.01852e-03 (wall 7.697e+00s cpu 8.9x)

(everything else being the same)

@inducer
Copy link
Owner

inducer commented Aug 18, 2022

(BEGIN UNRELATED TRAIN OF THOUGHT)
Technically, I think this ought to apply to more than just the FFT (in particular the translation operators). However, those get cached inside loopy kernels, which in turn live in global caches. This leads to CL contexts being held indefinitely unless we merge inducer/loopy#671, which I just made.
(END UNRELATED TRAIN OF THOUGHT)

I agree that shoving this in the CL context is not ideal, because it leads to a reference cycle that will never be garbage-collected. At some point (hopefully not too far into the future), array contexts will make their way into sumpy, at which point we could just use them here. Would you be willing to carry this diff yourself until that point?

@alexfikl
Copy link
Contributor Author

Would you be willing to carry this diff yourself until that point?

Yeah, I was thinking sumpy isn't that far off from having array context support too, so I'm fine with waiting with this until it gets there.

@alexfikl alexfikl marked this pull request as draft August 18, 2022 22:09
@isuruf
Copy link
Collaborator

isuruf commented Aug 19, 2022

I'm confused as to why this would cause a performance improvement. POCL has a global cache and that should make clBuildProgram return quickly.

@alexfikl
Copy link
Contributor Author

@isuruf My guess was that initializeVkFFT in
https://raw.githubusercontent.com/DTolm/VkFFT/master/vkFFT/vkFFT.h
was the one causing issues because it's pretty gigantic. I didn't have time to look into it in detail, but does it construct some plan for the FFT or other precomputations?

@isuruf
Copy link
Collaborator

isuruf commented Aug 19, 2022

Maybe that is a bottleneck. I was looking at the profile you attached in a previous comment at #134 (comment) which shows that clBuildProgram takes up most of the time.

@alexfikl
Copy link
Contributor Author

Maybe that is a bottleneck. I was looking at the profile you attached in a previous comment at #134 (comment) which shows that clBuildProgram takes up most of the time.

Hm, I used py-spy to get the profile, so maybe a sampling error? Assuming clBuildProgram is very fast when it hits the cache.

Would running it through cProfile be more helpful?

@isuruf
Copy link
Collaborator

isuruf commented Aug 19, 2022

Would running it through cProfile be more helpful?

Yes please. Is this the same code as #129 (comment)?

@alexfikl
Copy link
Contributor Author

Yes please. Is this the same code as #129 (comment)?

Nope, this is another code that does some time stepping with the Stokes velocity field. Running that should also show how long it takes to initialize pyvkfft for a similar setup, though.

I'll run it through cProfile and report back!

@alexfikl
Copy link
Contributor Author

profile..zip

@isuruf This should contain the cProfile results (pstats file) and the py-spy results (json file) with current sumpy@main. The run is doing 5 time steps to catch the repeated calls to get_opencl_app.

Both of them seem to show that there's quite a bit of time spent in get_opencl_fft_app, but not sure why py-spy is blaming clBuildProgram.

@isuruf
Copy link
Collaborator

isuruf commented Aug 19, 2022

Thanks. I can see that it is indeed quite slow. I'll have a look to see if that can be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants