Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt publication with circuit breaker #713

Closed
wants to merge 1 commit into from
Closed

Conversation

masih
Copy link
Member

@masih masih commented Oct 9, 2024

GPBFT silently ignores broadcast and rebroadcast requests as it is beyond its boundaries of responsibility to do something about. In practice such failures may be a sign that pubsub is overwhelmed with messages. Therefore, ideally the system should avoid aggravating the situation by requesting further broadcasts. This is specially important in re-broadcast requests because it often involves batch message publication.

The changes here wrap the pubsub publication calls with a circuit breaker that will open on consecutive errors (set to 5), and will not attempt to actually publish messages until a reset timeout (set to 3s) has passed.

Fixes #632

@@ -86,6 +88,7 @@ func newRunner(
ctxCancel: ctxCancel,
equivFilter: newEquivocationFilter(pID),
selfMessages: make(map[uint64]map[roundPhase][]*gpbft.GMessage),
cb: circuitbreaker.New(5, 3*time.Second),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on defaults?

@masih masih requested review from Stebalien and Kubuxu October 9, 2024 14:17
GPBFT silently ignores broadcast and rebroadcast requests as it is
beyond its boundaries of responsibility to do something about. In
practice such failures may be a sign that pubsub is overwhelmed with
messages. Therefore, ideally the system should avoid aggravating the
situation by requesting further broadcasts. This is specially important
in re-broadcast requests because it often involves batch message
publication.

The changes here wrap the pubsub publication calls with a circuit
breaker that will open on consecutive errors (set to `5`), and will not
attempt to actually publish messages until a reset timeout (set to `3s`)
has passed.

Fixes #632
Copy link

codecov bot commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 86.84211% with 5 lines in your changes missing coverage. Please review.

Project coverage is 76.28%. Comparing base (25c3ade) to head (f003b1f).

Files with missing lines Patch % Lines
host.go 66.66% 1 Missing and 2 partials ⚠️
internal/circuitbreaker/circuitbreaker.go 93.10% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #713      +/-   ##
==========================================
- Coverage   76.28%   76.28%   -0.01%     
==========================================
  Files          69       70       +1     
  Lines        5584     5620      +36     
==========================================
+ Hits         4260     4287      +27     
- Misses        912      920       +8     
- Partials      412      413       +1     
Files with missing lines Coverage Δ
internal/circuitbreaker/circuitbreaker.go 93.10% <93.10%> (ø)
host.go 66.28% <66.66%> (-0.52%) ⬇️

... and 1 file with indirect coverage changes

@Stebalien
Copy link
Member

I'm not really sure how this will help. We'll drop messages either way, this just means we'll drop messages when we don't actually need to do so. Maybe it'll help, I just can't think of how.

My original thinking was to have some form of retry setup. E.g.:

  1. Put all messages in a queue.
  2. Try sending messages out of the queue, backing off and retrying as necessary.
  3. Drop all messages currently in the queue when the rebroadcast timer fires.

@masih
Copy link
Member Author

masih commented Oct 9, 2024

I wish this feedback was added to the issue sooner/when it was captured.

Thanks for sharing. I'll take another shot at this.

@Stebalien
Copy link
Member

Yeah, sorry, I saw that the issue was filed and didn't pay much attention. Also, I'm not 100% sure either way. I feel like we should do something about failures but it's unclear exactly what.

@masih
Copy link
Member Author

masih commented Oct 9, 2024

Nah, i think what you suggested makes sense. For some reason I assumed the retry is built into pubsub and if it errors it means things are not recoverable or congested sufficiently such that the only thing that would help is to drop requests and conservatively retry. Otherwise, if the cause of error is congestion backoff alone could be worse, in that it could cause a prolonged slow publication flow.

I'd like to take a closer look at pubsub implementation for a more educated attempt.

@Stebalien
Copy link
Member

IIRC, it's caused by the queue filling up. Which means the main issue is bursting, so backing off and retrying should help as long as we don't let our queue/retry grow indefinitely.

@masih
Copy link
Member Author

masih commented Oct 10, 2024

Thanks Steven

@masih masih closed this Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider backing off if RequestBroadcast errors
2 participants