Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 BUG: unable to find host with relay error after a relayed client is restarted #1241

Open
theblop opened this issue Oct 10, 2024 · 2 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@theblop
Copy link

theblop commented Oct 10, 2024

What version of nebula are you using? (nebula -version)

1.9.4

What operating system are you using?

Linux

Describe the Bug

setup

I have a nebula client configured with 3 relays (which are also lighthouses but I don't think it matters) to connect to the mesh. Other clients don't need the relays:

client (relayed, in private subnet) -> relays (lighthouses) -> (MESH on public internet) <- other clients (non-relayed)

problem

When the relayed client first registers with the relays, everything works fine. When the client restarts and reconnects to the relays, some non-relayed clients can't connect back to it until I either restart them one by one, or at least one of the relays (which fixes all the problematic non-relayed clients in one go).

  • The relayed client is in a private subnet
  • the lighthouses/relays are exposed to the internet with 1:1 NAT
  • We don't have full control over the network infra of the non-relayed clients: they are on-prem at various customers locations, we asked them to open and forward the nebula port but I think some may still have problematic NAT (that's why we have punchy true)
  • some non-relayed clients are not affected by this issue (they reconnect immediately to the restarted relayed client), but since we don't have any control over their network infra it's hard to tell the difference between them and the other clients

Note: traffic between all the non-relayed clients is never affected.

fix (workaround)

Either:

  • restart the non-relayed clients (only fixes connection for individual clients to the relayed client)
  • or restart at least one lighthouse (fixes all clients connections to the relayed client)

Logs from affected hosts

some anonymized logs when the error happens (the timestamps don't match exactly, but the same messages are looped forever anyway):

lighthouses (relays):

logs:

time="2024-10-09T13:55:22Z" level=info msg="Failed to find target host info by ip" certName=otherclient1.mesh error="unable to find host with relay" localIndex=80854473 relayTo=100.96.2.13 remoteIndex=3176915269 vpnIp=100.99.63.1
...

(this message is quickly repeated for all the other clients and loops forever. 100.96.2.13 is the relayed client)

client (relayed):

logs:

time="2024-10-09T13:54:35Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2192126849 localIndex=2192126849 remoteIndex=0 udpAddrs="[X:X:X:X:4242 10.10.1.23:4242]" vpnIp=100.99.63.1
time="2024-10-09T13:54:41Z" level=info msg="Handshake timed out" durationNs=6073492769 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2192126849 localIndex=2192126849 remoteIndex=0 udpAddrs="[X:X:X:X:4242 10.10.1.23:4242]" vpnIp=100.99.63.1
...

(loops forever)

other clients (not relayed):

logs:

time="2024-10-09T15:55:22+02:00" level=info msg="Attempt to relay through hosts" localIndex=4278002031 relays="[100.96.0.1 100.96.0.2 100.96.0.3 100.96.0.1 100.96.0.2 100.96.0.3 100.96.0.1 100.96.0.2 100.96.0.3]" remoteIndex=0 vpnIp=100.96.2.13
time="2024-10-09T15:55:22+02:00" level=info msg="Send handshake via relay" localIndex=4278002031 relay=100.96.0.1 remoteIndex=0 vpnIp=100.96.2.13
time="2024-10-09T15:55:23+02:00" level=info msg="Handshake timed out" durationNs=3420931786 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=4278002031 localIndex=4278002031 remoteIndex=0 udpAddrs="[10.88.37.91:4242]" vpnIp=100.96.2.13

(loops forever)

Config files from affected hosts

lighthouses (relays):

config:

static_host_map:
lighthouse:
  am_lighthouse: true
punchy:
  punch: true
relay:
  am_relay: true
  use_relays: true

client (relayed):

config:

static_host_map:
  100.96.0.1:
    - lh1.example.org:4242
  100.96.0.2:
    - lh2.example.org:4242
  100.96.0.3:
    - lh3.example.org:4242
lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - 100.96.0.1
    - 100.96.0.2
    - 100.96.0.3
relay:
  am_relay: false
  use_relays: true
  relays:
    - 100.96.0.1
    - 100.96.0.2
    - 100.96.0.3
punchy:
  punch: true

other clients (not relayed):

config:

static_host_map:
  100.96.0.1:
    - lh1.example.org:4242
  100.96.0.2:
    - lh2.example.org:4242
  100.96.0.3:
    - lh3.example.org:4242
lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "100.96.0.3"
    - "100.96.0.2"
    - "100.96.0.1"
relay:
  am_relay: false
  use_relays: true
punchy:
  punch: true
  respond: true
@johnmaguire johnmaguire added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 11, 2024
@brad-defined
Copy link
Collaborator

Thanks for the detailed bug report @theblop - Unfortunately, I haven't been able to find the root cause after some code spelunking. Do you mind sharing more complete logs? As much as you can share from the hosts involved would help me understand more about how this relay connection broke, and how it's failing to recover.

@johnmaguire
Copy link
Collaborator

Most likely this is solved by #1270.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants