-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt publication with circuit breaker #713
Conversation
@@ -86,6 +88,7 @@ func newRunner( | |||
ctxCancel: ctxCancel, | |||
equivFilter: newEquivocationFilter(pID), | |||
selfMessages: make(map[uint64]map[roundPhase][]*gpbft.GMessage), | |||
cb: circuitbreaker.New(5, 3*time.Second), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on defaults?
GPBFT silently ignores broadcast and rebroadcast requests as it is beyond its boundaries of responsibility to do something about. In practice such failures may be a sign that pubsub is overwhelmed with messages. Therefore, ideally the system should avoid aggravating the situation by requesting further broadcasts. This is specially important in re-broadcast requests because it often involves batch message publication. The changes here wrap the pubsub publication calls with a circuit breaker that will open on consecutive errors (set to `5`), and will not attempt to actually publish messages until a reset timeout (set to `3s`) has passed. Fixes #632
eddac73
to
f003b1f
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #713 +/- ##
==========================================
- Coverage 76.28% 76.28% -0.01%
==========================================
Files 69 70 +1
Lines 5584 5620 +36
==========================================
+ Hits 4260 4287 +27
- Misses 912 920 +8
- Partials 412 413 +1
|
I'm not really sure how this will help. We'll drop messages either way, this just means we'll drop messages when we don't actually need to do so. Maybe it'll help, I just can't think of how. My original thinking was to have some form of retry setup. E.g.:
|
I wish this feedback was added to the issue sooner/when it was captured. Thanks for sharing. I'll take another shot at this. |
Yeah, sorry, I saw that the issue was filed and didn't pay much attention. Also, I'm not 100% sure either way. I feel like we should do something about failures but it's unclear exactly what. |
Nah, i think what you suggested makes sense. For some reason I assumed the retry is built into pubsub and if it errors it means things are not recoverable or congested sufficiently such that the only thing that would help is to drop requests and conservatively retry. Otherwise, if the cause of error is congestion backoff alone could be worse, in that it could cause a prolonged slow publication flow. I'd like to take a closer look at pubsub implementation for a more educated attempt. |
IIRC, it's caused by the queue filling up. Which means the main issue is bursting, so backing off and retrying should help as long as we don't let our queue/retry grow indefinitely. |
Thanks Steven |
GPBFT silently ignores broadcast and rebroadcast requests as it is beyond its boundaries of responsibility to do something about. In practice such failures may be a sign that pubsub is overwhelmed with messages. Therefore, ideally the system should avoid aggravating the situation by requesting further broadcasts. This is specially important in re-broadcast requests because it often involves batch message publication.
The changes here wrap the pubsub publication calls with a circuit breaker that will open on consecutive errors (set to
5
), and will not attempt to actually publish messages until a reset timeout (set to3s
) has passed.Fixes #632