-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CNS] add config snapshot event metrics at an interval #3140
base: master
Are you sure you want to change the base?
[CNS] add config snapshot event metrics at an interval #3140
Conversation
/azp run Azure Container Networking PR |
Azure Pipelines successfully started running 1 pipeline(s). |
@@ -5,6 +5,7 @@ | |||
"HeartBeatIntervalInMins": 30, | |||
"RefreshIntervalInSecs": 15, | |||
"SnapshotIntervalInMins": 60, | |||
"ConfigSnapshotIntervalInMins": 60, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CNS can't be reconfigured without being restarted. What do we gain by re-emitting the config at an interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some of the motivations for emitting the config at an interval:
- easier troubleshooting in Kusto (and avoid going on Node) in case CNS logs are out of retention
- CNS' feature set is growing, so it would be nice to have more observability to
- confirm feature flags
- track the feature rollout in a dashboard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rbtr If we only emit this during start up, any CNS process that outlives the retention policy in Kusto (~90 days) won't be shown in the snapshots. Some constant logging of the config for multiple CNS nodes also gives us the ability to know what nodes in a cluster are/were active during a specific time period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am convinced that we should re-emit the config, but I don't think that we should use that for liveness. And if we're not using it for liveness, and the retention is 90 days, 60 minutes seems a bit frequent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the usage scenarios for this is live site investigations where we want to determine the config with which CNS started. With that context, 60 minutes is not frequent.
@@ -5,6 +5,7 @@ | |||
"HeartBeatIntervalInMins": 30, | |||
"RefreshIntervalInSecs": 15, | |||
"SnapshotIntervalInMins": 60, | |||
"ConfigSnapshotIntervalInMins": 60, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am convinced that we should re-emit the config, but I don't think that we should use that for liveness. And if we're not using it for liveness, and the retention is 90 days, 60 minutes seems a bit frequent.
if telemetrySettings.ConfigSnapshotIntervalInMins == 0 { | ||
telemetrySettings.ConfigSnapshotIntervalInMins = 60 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be opt-in instead of no choice?
cs := md5.Sum(bb) //nolint:gosec // used for checksum | ||
csStr := string(cs[:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need a checksum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timraymond suggested a checksum on the equivalent PR in dnc to easily detect a change in config for the same machine/deployment (the alternative being a visual scan of the entire config). dnc's config is larger than cns's and with more dynamic parts, but this might still be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, in theory you'd be able to filter lines where these get emitted and easily pinpoint when the config changed, which is probably of most interest. Doing that precise filtering would also probably filter out all of the CNS startup fanfare that happens in the logs too, so I think it remains useful even if the config is immutable while CNS is running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a separate table for this type of telemetry (snapshots) so no need to worry about logs at least
select { | ||
case <-ctx.Done(): | ||
return | ||
case <-ticker.C: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will only start emitting the first config snapshot one hour after cns starts. you may want to emit this right away as well.
|
||
event := aitelemetry.Event{ | ||
EventName: logger.ConfigSnapshotMetricsStr, | ||
ResourceID: csStr, // not guaranteed unique, instead use VM ID and Subscription to correlate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some guid generated at process start up might be more useful than a hash that multiple processes will share
}, | ||
} | ||
|
||
assert.Equal(t, expected, event) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is a bit odd since it duplicates internal logic of (read: tightly couples) createCNSConfigSnapshotEvent
. if you change certain things in the function, the test will need to change as well. IMO a more robust test might do something like
event, err := createCNSConfigSnapshotEvent(config)
require.NoError(t, err)
assert.Equal(t, logger.ConfigSnapshotMetricsStr, event.EventName)
assert.NotEmpty(t, event.ResourceID)
// some assertion that the test contains a config property with valid json
// some assertion that the config deserialized from those bytes matches your input config
let me know if this makes sense
Reason for Change:
Inspired by @ramiro-gamarra's PR to emit config snapshots at an interval, we should do the same for CNS for a consistent view for both components.
Issue Fixed:
Requirements:
Notes: