Suggestion for better Nats connection handling in the Agent #2372
nouseforaname
started this conversation in
Ideas
Replies: 2 comments
-
@nouseforaname most of the stuff will be implemented in cloudfoundry/bosh-agent#279. Is there still something open for discussion? |
Beta Was this translation helpful? Give feedback.
0 replies
-
I had another look into cloudfoundry/bosh-agent#279. So what is still open after that one is merged IMHO is:
@nouseforaname Where do you see improvement options regarding the hostname use case? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I would like to suggest to somewhat refactor the current implementation for the
director nats client in the agent.
The general area around nats communication currently does not produce helpful logs.
We discovered this when we debugged failed deployments right after a director upgrade.
In some situations there seems to be a race condition in between
1 Old director going down
2 Agent on the VM retries sending heartbeats
3 Director comes up and gets deploy command issued right away -> sends ping to
agent
4 ????. Currently unclear why at this point the agent has not yet reconnected to Nats
5 Director doesn't get a response in time and the deploy command fails with timed_out.
The issue has been solved in CI by introducing a 3 minute wait after the upgrade
of the director. It seems that if the agent DID NOT crash and restart within BOSHs
downtime, the connection is somehow in a bad state. Cunnie suggested that the ARP
Cache could be the reason the connection is not successfully re-established after the
director comes back up since the new director IP may have a different Mac Address
and cleaning the ARP table only happens on agent restarts.
There is a PR to handle the above situation with some extra logging but while doing
the research for it, it seems that there are quite a few possible optimizations around
that area.
There are additional UseCases where we might want to have more logic to react to certain
agent connection issues.
Notably:
Instead for waiting for dns cache to expire on a connection failure, we could proactively
start checking for new IPs and reconfigure the connection on the fly if necessary. That
would also apply to updating the firewall rules for outgoing nats traffic
Ultimately it may a good idea to move retry logic away from bosh native retry for Nats Client. My
impression is that depending on the underlying client requests have different flows that may not
be properly covered by the current use of bosh retryable to handle request failures. Additionally
while connection failures are retried on client level, requests are retried externally making it harder
to debug failures or more gracefully handle issues within the client itself..
If you've made it this far, thanks for reading and please leave feedback / suggestions.
Thanks in advance.
Beta Was this translation helpful? Give feedback.
All reactions