Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CNS] add config snapshot event metrics at an interval #3140

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

jackieluc
Copy link
Member

@jackieluc jackieluc commented Nov 12, 2024

Reason for Change:
Inspired by @ramiro-gamarra's PR to emit config snapshots at an interval, we should do the same for CNS for a consistent view for both components.

Issue Fixed:

Requirements:

Notes:

@jackieluc jackieluc added cns Related to CNS. telemetry labels Nov 12, 2024
@jackieluc jackieluc requested a review from a team as a code owner November 12, 2024 22:44
@jackieluc
Copy link
Member Author

/azp run Azure Container Networking PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jackieluc jackieluc changed the title [CNS] add CNS config snapshot event metrics at an interval [CNS] add config snapshot event metrics at an interval Nov 12, 2024
@@ -5,6 +5,7 @@
"HeartBeatIntervalInMins": 30,
"RefreshIntervalInSecs": 15,
"SnapshotIntervalInMins": 60,
"ConfigSnapshotIntervalInMins": 60,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CNS can't be reconfigured without being restarted. What do we gain by re-emitting the config at an interval?

Copy link
Member Author

@jackieluc jackieluc Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some of the motivations for emitting the config at an interval:

  • easier troubleshooting in Kusto (and avoid going on Node) in case CNS logs are out of retention
  • CNS' feature set is growing, so it would be nice to have more observability to
    • confirm feature flags
    • track the feature rollout in a dashboard

Copy link
Contributor

@ramiro-gamarra ramiro-gamarra Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbtr If we only emit this during start up, any CNS process that outlives the retention policy in Kusto (~90 days) won't be shown in the snapshots. Some constant logging of the config for multiple CNS nodes also gives us the ability to know what nodes in a cluster are/were active during a specific time period.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am convinced that we should re-emit the config, but I don't think that we should use that for liveness. And if we're not using it for liveness, and the retention is 90 days, 60 minutes seems a bit frequent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the usage scenarios for this is live site investigations where we want to determine the config with which CNS started. With that context, 60 minutes is not frequent.

@@ -5,6 +5,7 @@
"HeartBeatIntervalInMins": 30,
"RefreshIntervalInSecs": 15,
"SnapshotIntervalInMins": 60,
"ConfigSnapshotIntervalInMins": 60,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am convinced that we should re-emit the config, but I don't think that we should use that for liveness. And if we're not using it for liveness, and the retention is 90 days, 60 minutes seems a bit frequent.

Comment on lines +182 to +184
if telemetrySettings.ConfigSnapshotIntervalInMins == 0 {
telemetrySettings.ConfigSnapshotIntervalInMins = 60
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be opt-in instead of no choice?

Comment on lines +45 to +46
cs := md5.Sum(bb) //nolint:gosec // used for checksum
csStr := string(cs[:])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a checksum?

Copy link
Contributor

@ramiro-gamarra ramiro-gamarra Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timraymond suggested a checksum on the equivalent PR in dnc to easily detect a change in config for the same machine/deployment (the alternative being a visual scan of the entire config). dnc's config is larger than cns's and with more dynamic parts, but this might still be useful

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in theory you'd be able to filter lines where these get emitted and easily pinpoint when the config changed, which is probably of most interest. Doing that precise filtering would also probably filter out all of the CNS startup fanfare that happens in the logs too, so I think it remains useful even if the config is immutable while CNS is running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a separate table for this type of telemetry (snapshots) so no need to worry about logs at least

select {
case <-ctx.Done():
return
case <-ticker.C:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will only start emitting the first config snapshot one hour after cns starts. you may want to emit this right away as well.


event := aitelemetry.Event{
EventName: logger.ConfigSnapshotMetricsStr,
ResourceID: csStr, // not guaranteed unique, instead use VM ID and Subscription to correlate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some guid generated at process start up might be more useful than a hash that multiple processes will share

},
}

assert.Equal(t, expected, event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is a bit odd since it duplicates internal logic of (read: tightly couples) createCNSConfigSnapshotEvent. if you change certain things in the function, the test will need to change as well. IMO a more robust test might do something like

event, err := createCNSConfigSnapshotEvent(config)
require.NoError(t, err)

assert.Equal(t, logger.ConfigSnapshotMetricsStr, event.EventName)
assert.NotEmpty(t, event.ResourceID)
// some assertion that the test contains a config property with valid json
// some assertion that the config deserialized from those bytes matches your input config

let me know if this makes sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cns Related to CNS. telemetry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants