Attempt publication with circuit breaker #713

masih · 2024-10-09T14:15:50Z

GPBFT silently ignores broadcast and rebroadcast requests as it is beyond its boundaries of responsibility to do something about. In practice such failures may be a sign that pubsub is overwhelmed with messages. Therefore, ideally the system should avoid aggravating the situation by requesting further broadcasts. This is specially important in re-broadcast requests because it often involves batch message publication.

The changes here wrap the pubsub publication calls with a circuit breaker that will open on consecutive errors (set to 5), and will not attempt to actually publish messages until a reset timeout (set to 3s) has passed.

Fixes #632

masih · 2024-10-09T14:16:28Z

host.go

@@ -86,6 +88,7 @@ func newRunner(
 		ctxCancel:    ctxCancel,
 		equivFilter:  newEquivocationFilter(pID),
 		selfMessages: make(map[uint64]map[roundPhase][]*gpbft.GMessage),
+		cb:           circuitbreaker.New(5, 3*time.Second),


Thoughts on defaults?

GPBFT silently ignores broadcast and rebroadcast requests as it is beyond its boundaries of responsibility to do something about. In practice such failures may be a sign that pubsub is overwhelmed with messages. Therefore, ideally the system should avoid aggravating the situation by requesting further broadcasts. This is specially important in re-broadcast requests because it often involves batch message publication. The changes here wrap the pubsub publication calls with a circuit breaker that will open on consecutive errors (set to `5`), and will not attempt to actually publish messages until a reset timeout (set to `3s`) has passed. Fixes #632

codecov · 2024-10-09T14:31:56Z

Codecov Report

Attention: Patch coverage is 86.84211% with 5 lines in your changes missing coverage. Please review.

Project coverage is 76.28%. Comparing base (25c3ade) to head (f003b1f).

Files with missing lines	Patch %	Lines
host.go	66.66%	1 Missing and 2 partials ⚠️
internal/circuitbreaker/circuitbreaker.go	93.10%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #713      +/-   ##
==========================================
- Coverage   76.28%   76.28%   -0.01%     
==========================================
  Files          69       70       +1     
  Lines        5584     5620      +36     
==========================================
+ Hits         4260     4287      +27     
- Misses        912      920       +8     
- Partials      412      413       +1

Files with missing lines	Coverage Δ
internal/circuitbreaker/circuitbreaker.go	`93.10% <93.10%> (ø)`
host.go	`66.28% <66.66%> (-0.52%)`	⬇️

... and 1 file with indirect coverage changes

Stebalien · 2024-10-09T19:01:46Z

I'm not really sure how this will help. We'll drop messages either way, this just means we'll drop messages when we don't actually need to do so. Maybe it'll help, I just can't think of how.

My original thinking was to have some form of retry setup. E.g.:

Put all messages in a queue.
Try sending messages out of the queue, backing off and retrying as necessary.
Drop all messages currently in the queue when the rebroadcast timer fires.

masih · 2024-10-09T19:45:46Z

I wish this feedback was added to the issue sooner/when it was captured.

Thanks for sharing. I'll take another shot at this.

Stebalien · 2024-10-09T20:26:31Z

Yeah, sorry, I saw that the issue was filed and didn't pay much attention. Also, I'm not 100% sure either way. I feel like we should do something about failures but it's unclear exactly what.

masih · 2024-10-09T20:43:29Z

Nah, i think what you suggested makes sense. For some reason I assumed the retry is built into pubsub and if it errors it means things are not recoverable or congested sufficiently such that the only thing that would help is to drop requests and conservatively retry. Otherwise, if the cause of error is congestion backoff alone could be worse, in that it could cause a prolonged slow publication flow.

I'd like to take a closer look at pubsub implementation for a more educated attempt.

Stebalien · 2024-10-09T20:59:38Z

IIRC, it's caused by the queue filling up. Which means the main issue is bursting, so backing off and retrying should help as long as we don't let our queue/retry grow indefinitely.

masih · 2024-10-10T08:15:54Z

Thanks Steven

masih commented Oct 9, 2024

View reviewed changes

masih requested review from Stebalien and Kubuxu October 9, 2024 14:17

masih force-pushed the masih/publish-cb branch from eddac73 to f003b1f Compare October 9, 2024 14:18

masih closed this Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt publication with circuit breaker #713

Attempt publication with circuit breaker #713

masih commented Oct 9, 2024 •

edited

Loading

masih Oct 9, 2024

codecov bot commented Oct 9, 2024 •

edited

Loading

Stebalien commented Oct 9, 2024

masih commented Oct 9, 2024 •

edited

Loading

Stebalien commented Oct 9, 2024

masih commented Oct 9, 2024

Stebalien commented Oct 9, 2024

masih commented Oct 10, 2024

Attempt publication with circuit breaker #713

Attempt publication with circuit breaker #713

Conversation

masih commented Oct 9, 2024 • edited Loading

masih Oct 9, 2024

Choose a reason for hiding this comment

codecov bot commented Oct 9, 2024 • edited Loading

Codecov Report

Stebalien commented Oct 9, 2024

masih commented Oct 9, 2024 • edited Loading

Stebalien commented Oct 9, 2024

masih commented Oct 9, 2024

Stebalien commented Oct 9, 2024

masih commented Oct 10, 2024

masih commented Oct 9, 2024 •

edited

Loading

codecov bot commented Oct 9, 2024 •

edited

Loading

masih commented Oct 9, 2024 •

edited

Loading