[CNS] add config snapshot event metrics at an interval #3140

jackieluc · 2024-11-12T22:44:54Z

Reason for Change:
Inspired by @ramiro-gamarra's PR to emit config snapshots at an interval, we should do the same for CNS for a consistent view for both components.

Issue Fixed:

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
relevant PR labels added

Notes:

jackieluc · 2024-11-12T22:45:49Z

/azp run Azure Container Networking PR

azure-pipelines · 2024-11-12T22:45:59Z

Azure Pipelines successfully started running 1 pipeline(s).

rbtr · 2024-11-12T23:20:24Z

cns/configuration/cns_config.json

@@ -5,6 +5,7 @@
        "HeartBeatIntervalInMins": 30,
        "RefreshIntervalInSecs": 15,
        "SnapshotIntervalInMins": 60,
+        "ConfigSnapshotIntervalInMins": 60,


CNS can't be reconfigured without being restarted. What do we gain by re-emitting the config at an interval?

Here are some of the motivations for emitting the config at an interval:

easier troubleshooting in Kusto (and avoid going on Node) in case CNS logs are out of retention

CNS' feature set is growing, so it would be nice to have more observability to

confirm feature flags

track the feature rollout in a dashboard

@rbtr If we only emit this during start up, any CNS process that outlives the retention policy in Kusto (~90 days) won't be shown in the snapshots. Some constant logging of the config for multiple CNS nodes also gives us the ability to know what nodes in a cluster are/were active during a specific time period.

I am convinced that we should re-emit the config, but I don't think that we should use that for liveness. And if we're not using it for liveness, and the retention is 90 days, 60 minutes seems a bit frequent.

One of the usage scenarios for this is live site investigations where we want to determine the config with which CNS started. With that context, 60 minutes is not frequent.

rbtr · 2024-11-13T17:11:36Z

cns/configuration/cns_config.json

@@ -5,6 +5,7 @@
        "HeartBeatIntervalInMins": 30,
        "RefreshIntervalInSecs": 15,
        "SnapshotIntervalInMins": 60,
+        "ConfigSnapshotIntervalInMins": 60,


I am convinced that we should re-emit the config, but I don't think that we should use that for liveness. And if we're not using it for liveness, and the retention is 90 days, 60 minutes seems a bit frequent.

rbtr · 2024-11-13T17:15:10Z

cns/configuration/configuration.go

+	if telemetrySettings.ConfigSnapshotIntervalInMins == 0 {
+		telemetrySettings.ConfigSnapshotIntervalInMins = 60
+	}


should this be opt-in instead of no choice?

rbtr · 2024-11-13T17:19:02Z

cns/metric/configsnapshot.go

+	cs := md5.Sum(bb) //nolint:gosec // used for checksum
+	csStr := string(cs[:])


why do we need a checksum?

@timraymond suggested a checksum on the equivalent PR in dnc to easily detect a change in config for the same machine/deployment (the alternative being a visual scan of the entire config). dnc's config is larger than cns's and with more dynamic parts, but this might still be useful

Yeah, in theory you'd be able to filter lines where these get emitted and easily pinpoint when the config changed, which is probably of most interest. Doing that precise filtering would also probably filter out all of the CNS startup fanfare that happens in the logs too, so I think it remains useful even if the config is immutable while CNS is running.

we have a separate table for this type of telemetry (snapshots) so no need to worry about logs at least

ramiro-gamarra · 2024-11-13T17:02:39Z

cns/metric/configsnapshot.go

+		select {
+		case <-ctx.Done():
+			return
+		case <-ticker.C:


this will only start emitting the first config snapshot one hour after cns starts. you may want to emit this right away as well.

ramiro-gamarra · 2024-11-13T17:06:32Z

cns/metric/configsnapshot.go

+
+	event := aitelemetry.Event{
+		EventName:  logger.ConfigSnapshotMetricsStr,
+		ResourceID: csStr, // not guaranteed unique, instead use VM ID and Subscription to correlate


some guid generated at process start up might be more useful than a hash that multiple processes will share

ramiro-gamarra · 2024-11-13T17:29:08Z

cns/metric/configsnapshot_test.go

+		},
+	}
+
+	assert.Equal(t, expected, event)


this test is a bit odd since it duplicates internal logic of (read: tightly couples) createCNSConfigSnapshotEvent. if you change certain things in the function, the test will need to change as well. IMO a more robust test might do something like

event, err := createCNSConfigSnapshotEvent(config) require.NoError(t, err) assert.Equal(t, logger.ConfigSnapshotMetricsStr, event.EventName) assert.NotEmpty(t, event.ResourceID) // some assertion that the test contains a config property with valid json // some assertion that the config deserialized from those bytes matches your input config

let me know if this makes sense

jackieluc added 2 commits November 12, 2024 09:35

feat: add and expose new ConfigSnapshotIntervalInMins config

c9d4de6

feat: add interval event emitting of CNS config snapshot

37da1b3

jackieluc added cns Related to CNS. telemetry labels Nov 12, 2024

jackieluc requested a review from a team as a code owner November 12, 2024 22:44

jackieluc requested a review from neaggarwMS November 12, 2024 22:44

Merge branch 'master' into jackieluc/config-snapshot-metric

6318f95

jackieluc changed the title ~~[CNS] add CNS config snapshot event metrics at an interval~~ [CNS] add config snapshot event metrics at an interval Nov 12, 2024

lint: address lint errors

fd88521

rbtr reviewed Nov 12, 2024

View reviewed changes

rbtr reviewed Nov 13, 2024

View reviewed changes

ramiro-gamarra reviewed Nov 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CNS] add config snapshot event metrics at an interval #3140

[CNS] add config snapshot event metrics at an interval #3140

jackieluc commented Nov 12, 2024 •

edited

Loading

jackieluc commented Nov 12, 2024

azure-pipelines bot commented Nov 12, 2024

rbtr Nov 12, 2024

jackieluc Nov 12, 2024 •

edited

Loading

ramiro-gamarra Nov 13, 2024 •

edited

Loading

rbtr Nov 13, 2024

msvik Nov 14, 2024

rbtr Nov 13, 2024

rbtr Nov 13, 2024

rbtr Nov 13, 2024

ramiro-gamarra Nov 13, 2024 •

edited

Loading

timraymond Nov 13, 2024

ramiro-gamarra Nov 13, 2024

ramiro-gamarra Nov 13, 2024

ramiro-gamarra Nov 13, 2024

ramiro-gamarra Nov 13, 2024

		cs := md5.Sum(bb) //nolint:gosec // used for checksum
		csStr := string(cs[:])

[CNS] add config snapshot event metrics at an interval #3140

Are you sure you want to change the base?

[CNS] add config snapshot event metrics at an interval #3140

Conversation

jackieluc commented Nov 12, 2024 • edited Loading

jackieluc commented Nov 12, 2024

azure-pipelines bot commented Nov 12, 2024

Choose a reason for hiding this comment

jackieluc Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

ramiro-gamarra Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ramiro-gamarra Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackieluc commented Nov 12, 2024 •

edited

Loading

jackieluc Nov 12, 2024 •

edited

Loading

ramiro-gamarra Nov 13, 2024 •

edited

Loading

ramiro-gamarra Nov 13, 2024 •

edited

Loading