Replies: 5 comments 1 reply
-
This is a pretty esoteric situation. What would expedite us investigating it is if you provide a script or some other means that we can reproduce this easily. Ideally it would be as simple as |
Beta Was this translation helpful? Give feedback.
-
Thanks for quick feedback. I'll try to prepare a simple simulation script. |
Beta Was this translation helpful? Give feedback.
-
This is how TCP works: it retries for a period of time before it declares the other end of the connection to be unresponsive. Heartbeats (note that values < 5s are explicitly recommended against) and Publisher confirm reception timeouts will help. TCP parameter tuning on client hosts can help, too. This is mentioned somewhat in the Heartbeats guide. |
Beta Was this translation helpful? Give feedback.
-
FWIW, we seem to be running into this exact problem regularly under high load (not sure if that is the trigger though). This is the thread dump:
Our current workaround is to wrap the publishing in an |
Beta Was this translation helpful? Give feedback.
-
Thanks for providing some steps to reproduce, I'll investigate more shortly. In the meantime, you can try to:
|
Beta Was this translation helpful? Give feedback.
-
Rabbit client can freeze during writing to socket when the network interface is removed. For example, we can run an app in docker, disconnect the network with
docker network disconnect ...
command. If the connection is currently handlingbasicPublish
, it is very likely that this call get stuck for a long time. No timeout configurations seem to help (SO_TIMEOUT
, heartbeats,SO_KEEPALIVE
, ...).The thread is stuck with this stacktrace:
We can see that the sending buffer is occupied somehow in
netstat
output.By the analysis of this library source code and
NioSocketImpl
sources, it is clear that the socket seems to be still in "recoverable" state. Theflush
call is blocked, theimplWrite
is still optimistic about the possibility to write more (but not yet).Ideally, either the flush will throw an exception (but that doesn't happen), or we can detect "heartbeat timeouts" in this library and close the connection from outside.
If we try to implement this kind of behavior in the application itself, we fail. For example, if we time-out the
basicPublish
call and then try toclose
/abort
the connection, it always tries to write something to the socket, so therefore it blocks as well.For this reason, we believe that this is a bug in the library itself. However, very subtle and hard to fix.
Beta Was this translation helpful? Give feedback.
All reactions