Replies: 1 comment 6 replies
-
I'm not sure to understand, the model doesn't see multiple user at the time, as you're doing only one query at the time. As for concurrent requests, I personally use the built llama.cpp server binary which allows for multiple queries at the same time. But if you want two users to have 15k context each, you need to allocate the model with a ctx size of 30k |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi i have written a flask backend with llama-cpp-python. Multiple user can talk to a model over a webinterface. When i have a single user session all works fine, but when a second user talks to the model i encounter performance issues. The model is protected by a mutex so request are answered sequentially. Context for each user session is cached and feeded into the model together with a new request. The model is running on GPU and i am layering off 20/33 layers (More is not possible due to lack of GPU memory).
Does anyone has an idea why the models performance decrease when using it with multiple user?
Beta Was this translation helpful? Give feedback.
All reactions