You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We run fluentd (which uses serverengine) in a container, sometimes the workers keep dying in a tight loop, which puts lots of stress on the system, but it's not visible to the outside since the server process just keeps restarting the workers.
I'd like to make a PR to add a "max_crash_frequency" or so flag, that would crash the server if the worker crash frequency goes above a certain value (like 10/minute or so)
# frozen_string_literal: true# changes serverengine/lib/serverengine/multi_process_server.rb to crash the server when workers fail too often## https://github.com/treasure-data/serverengine/issues/96modulePreventWorkerCrashloopMAX_WORKER_CRASHES=5MAX_WORKER_CRASH_INTERVAL=5 * 60defalive?alive=superif !alive && !@unrecoverable_exitnow=Time.now.to_icutoff=now - MAX_WORKER_CRASH_INTERVALfailures=(@@failures_timestamps ||= [])# rubocop:disable Style/ClassVarsfailures.reject!{ |t| t < cutoff}failures << nowiffailures.size >= MAX_WORKER_CRASHESdiff=now - failures.first@worker.logger.error("PreventWorkerCrashloop killing server because of #{failures.size} worker crashes in #{diff}s")@unrecoverable_exit=trueendendaliveendendServerEngine::MultiProcessServer::WorkerMonitor.prepend(PreventWorkerCrashloop)
The text was updated successfully, but these errors were encountered:
We run fluentd (which uses serverengine) in a container, sometimes the workers keep dying in a tight loop, which puts lots of stress on the system, but it's not visible to the outside since the server process just keeps restarting the workers.
I'd like to make a PR to add a "max_crash_frequency" or so flag, that would crash the server if the worker crash frequency goes above a certain value (like 10/minute or so)
/cc @repeatedly @tagomoris
Monkeypatch atm is:
The text was updated successfully, but these errors were encountered: