Suggestion for better Nats connection handling in the Agent #2372

nouseforaname · 2022-04-12T14:23:24Z

nouseforaname
Apr 12, 2022
Collaborator

I would like to suggest to somewhat refactor the current implementation for the
director nats client in the agent.

The general area around nats communication currently does not produce helpful logs.
We discovered this when we debugged failed deployments right after a director upgrade.
In some situations there seems to be a race condition in between

1 Old director going down
2 Agent on the VM retries sending heartbeats
3 Director comes up and gets deploy command issued right away -> sends ping to
agent
4 ????. Currently unclear why at this point the agent has not yet reconnected to Nats
5 Director doesn't get a response in time and the deploy command fails with timed_out.

The issue has been solved in CI by introducing a 3 minute wait after the upgrade
of the director. It seems that if the agent DID NOT crash and restart within BOSHs
downtime, the connection is somehow in a bad state. Cunnie suggested that the ARP
Cache could be the reason the connection is not successfully re-established after the
director comes back up since the new director IP may have a different Mac Address
and cleaning the ARP table only happens on agent restarts.

There is a PR to handle the above situation with some extra logging but while doing
the research for it, it seems that there are quite a few possible optimizations around
that area.

There are additional UseCases where we might want to have more logic to react to certain
agent connection issues.

Notably:

Exponential reconnect backoffs to have less pressure on a starting bosh
Better support for directors that are deployed with hostnames instead of Static IPs.
Instead for waiting for dns cache to expire on a connection failure, we could proactively
start checking for new IPs and reconfigure the connection on the fly if necessary. That
would also apply to updating the firewall rules for outgoing nats traffic
Avoid unnecessary agent restarts in some situations and increase resiliency
Maybe other benefits

Ultimately it may a good idea to move retry logic away from bosh native retry for Nats Client. My
impression is that depending on the underlying client requests have different flows that may not
be properly covered by the current use of bosh retryable to handle request failures. Additionally
while connection failures are retried on client level, requests are retried externally making it harder
to debug failures or more gracefully handle issues within the client itself..

If you've made it this far, thanks for reading and please leave feedback / suggestions.

Thanks in advance.

beyhan · 2022-04-28T14:27:46Z

beyhan
Apr 28, 2022
Maintainer

@nouseforaname most of the stuff will be implemented in cloudfoundry/bosh-agent#279. Is there still something open for discussion?

0 replies

beyhan · 2022-05-06T14:38:04Z

beyhan
May 6, 2022
Maintainer

I had another look into cloudfoundry/bosh-agent#279. So what is still open after that one is merged IMHO is:

Update the firewall rules for outgoing nats traffic on reconnection
Avoid unnecessary agent restarts in some situations and increase resiliency

@nouseforaname Where do you see improvement options regarding the hostname use case?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion for better Nats connection handling in the Agent #2372

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Suggestion for better Nats connection handling in the Agent #2372

nouseforaname Apr 12, 2022 Collaborator

Replies: 2 comments

beyhan Apr 28, 2022 Maintainer

beyhan May 6, 2022 Maintainer

nouseforaname
Apr 12, 2022
Collaborator

beyhan
Apr 28, 2022
Maintainer

beyhan
May 6, 2022
Maintainer