Use model for multiple user sessions #1730

riebers-m · 2024-09-04T14:38:28Z

riebers-m
Sep 4, 2024

Hi i have written a flask backend with llama-cpp-python. Multiple user can talk to a model over a webinterface. When i have a single user session all works fine, but when a second user talks to the model i encounter performance issues. The model is protected by a mutex so request are answered sequentially. Context for each user session is cached and feeded into the model together with a new request. The model is running on GPU and i am layering off 20/33 layers (More is not possible due to lack of GPU memory).

Does anyone has an idea why the models performance decrease when using it with multiple user?

ExtReMLapin · 2024-09-05T03:57:09Z

ExtReMLapin
Sep 5, 2024

I'm not sure to understand, the model doesn't see multiple user at the time, as you're doing only one query at the time.

As for concurrent requests, I personally use the built llama.cpp server binary which allows for multiple queries at the same time.

But if you want two users to have 15k context each, you need to allocate the model with a ctx size of 30k

6 replies

ExtReMLapin Sep 5, 2024

Alright I understand, the model itself should see or understand the difference front one user sending 100 messages (from history, stacked) or 2 users with 100 messages each as they are separed.

One thing to do is stick adding a print() call inside the llama cpp create completion to see what is going on and be sure of what you're sending to him.

It's expected to go slower the longer your prompt is (so history length) but it should not be impacted from a run to another.

From what I understand, with what you did, there should no difference between 500 and 1 user because each data/chat/history is isolated.

riebers-m Sep 6, 2024
Author

Yes basically i inject different strings in the create_completion() method from two flask runners with a shared model.

The length of the history (input string) does not effect the behavior. My guess is, it is because of different threads using the model. I dont know what the system does because only a portion of the model is sourced out to the GPU. Maybe this would not be if i could load the model completely into GPU.

I will make a test for single/multiple python threads and give feedback.

ExtReMLapin Sep 6, 2024

from two flask runners

Two flasks runners ? What ?

On my end there is just one endpoint calling it. What do you mean by " from two flask runners"

riebers-m Sep 6, 2024
Author

I mean that the flask server runs one thread for every SSE frontend connection. Sorry for the confusion.

riebers-m Sep 24, 2024
Author

Hi, sorry for the late answer (i was on vacation).

I build a small test program with the same setup as my flask app.
Two threads are using one model guarded with a mutex and seperated histories.
In this example i dont encounter performance issues. Then it must be something with flask SSE?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use model for multiple user sessions #1730

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Use model for multiple user sessions #1730

riebers-m Sep 4, 2024

Replies: 1 comment · 6 replies

ExtReMLapin Sep 5, 2024

ExtReMLapin Sep 5, 2024

riebers-m Sep 6, 2024 Author

ExtReMLapin Sep 6, 2024

riebers-m Sep 6, 2024 Author

riebers-m Sep 24, 2024 Author

riebers-m
Sep 4, 2024

Replies: 1 comment 6 replies

ExtReMLapin
Sep 5, 2024

riebers-m Sep 6, 2024
Author

riebers-m Sep 6, 2024
Author

riebers-m Sep 24, 2024
Author