Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UDP template expiry in collector process #381

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

antoninbas
Copy link
Member

Prior to this change, for every template received by the UDP collector, a new goroutine with a ticker was created, responsible for "expiring" the template after the proper timeout. After expiration, the ticker was stopped and te goroutine would complete (the ticker would only ever tick once, so really a timer should have been used?). If a client were to send a refreshed template, the old goroutine / ticker would not be stopped, and the new template would be deleted by the old goroutine, which is invalid. Additionally, starting a new goroutine every time a template message is received is a bit expensive.

With the new implementation, we use a timer created with time.AfterFunc. All the timers are handled by the go runtime with a single goroutine, making this very lightweight. The timers are stored as part of the template map. When a template is refreshed, we reset the existing timer. When a template is deleted, we stop the timer.

This is the first in a series of patch to improve template management in the collector process.

Prior to this change, for every template received by the UDP collector,
a new goroutine with a ticker was created, responsible for "expiring"
the template after the proper timeout. After expiration, the ticker was
stopped and te goroutine would complete (the ticker would only ever tick
once, so really a timer should have been used?). If a client were to
send a refreshed template, the old goroutine / ticker would not be
stopped, and the new template would be deleted by the old goroutine,
which is invalid. Additionally, starting a new goroutine every time a
template message is received is a bit expensive.

With the new implementation, we use a timer created with
time.AfterFunc. All the timers are handled by the go runtime with a
single goroutine, making this very lightweight. The timers are stored as
part of the template map. When a template is refreshed, we reset the
existing timer. When a template is deleted, we stop the timer.

This is the first in a series of patch to improve template management in
the collector process.

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
Copy link

codecov bot commented Nov 14, 2024

Codecov Report

Attention: Patch coverage is 97.10145% with 2 lines in your changes missing coverage. Please review.

Project coverage is 67.53%. Comparing base (43c0400) to head (3bc36ea).

Files with missing lines Patch % Lines
pkg/collector/process.go 97.10% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #381      +/-   ##
==========================================
+ Coverage   67.12%   67.53%   +0.40%     
==========================================
  Files          25       25              
  Lines        4465     4512      +47     
==========================================
+ Hits         2997     3047      +50     
+ Misses       1273     1271       -2     
+ Partials      195      194       -1     
Flag Coverage Δ
integration-tests 51.52% <33.33%> (+1.04%) ⬆️
unit-tests 66.53% <97.10%> (+0.28%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pkg/collector/process.go 89.49% <97.10%> (+0.72%) ⬆️

... and 1 file with indirect coverage changes

@@ -425,20 +426,49 @@ func TestCollectingProcess_DecodeDataRecord(t *testing.T) {
assert.NotNil(t, err, "Error should be logged for malformed data record")
}

// testClock is a wrapper around clocktesting.FakeClock. Unfortunately, FakeClock does not support
// calling clock.Now() when executing an AfterFunc() function, which is require in our case. So we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// calling clock.Now() when executing an AfterFunc() function, which is require in our case. So we
// calling clock.Now() when executing an AfterFunc() function, which is required in our case. So we

// scheduled for deletion by checking expiryTime. We cannot just
// automatically delete the template.

// No reason to try to stop the timer in this case, even though it would be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to follow the logic, is it because the expiryTimer is already nil, so there's no point to set stopTimer to true? Because I was thinking about L444, wondering if there's a case when stopTimer is true but expiryTimer is nil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code executes as part of the function invoked by the timer. Broadly speaking, if the function has been invoked by the timer, then the timer has already fired and there is no need to stop it. There are some edge cases with AfterFunc which are described in the documentation snippet above.
Essentially there are 2 cases:

  • the timer fires because we need to expire the template. This is the "normal" case. The timer has fired and there is no need to stop it.
  • the timer fires, but there is a concurrent call to addTemplate (and to AfterFunc). This is the edge case described in the doc. We rely on the condition below (!tpl.expiryTime.After(now)) to determine whether we should just ignore this firing of the timer. The synchronization provided by cp.mutex is very important here. We will wait until the call to addTemplate returns and releases the lock. If the condition is false (which will be the case), we will not call expiryTimer.Stop() in deleteTemplateWithConds (even if stopTimer was set to true), as the template entry has been updated and the current run of the timer function should be ignored and be a no-op.

Calling expiryTimer.Stop() even if the timer has already been stopped is harmless. But it doesn't work well with the fake clock unfortunately. If the fake clock implementation is fixed, I may remove stopTimer.

The nil check at line L444 is more for defensive programming and for the TCP case.

func (cp *CollectingProcess) deleteTemplateWithConds(obsDomainID uint32, templateID uint16, stopTimer bool, condFns ...func(*template) bool) bool {
cp.mutex.Lock()
defer cp.mutex.Unlock()
template, ok := cp.templatesMap[obsDomainID][templateID]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the templateID be deleted before, and the timer cannot be stopped when we can't find the templateID in templatesMap?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which case / code path are you referring to?
the only case where stopTimer should be false is if the caller knows for certain than the timer has already expired (taking into account that condFns should all be true).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, when we remove the templateID from a templatesMap, shouldn't we call the expiryTimer.Stop()? otherwise; wouldn't expiryTimer keep running?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also remind me where we call deleteTemplate? If we call deleteTemplate after a template is expired, deleteTemplate will return false and the expiryTimer. Stop () won't be called?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, when we remove the templateID from a templatesMap, shouldn't we call the expiryTimer.Stop()? otherwise; wouldn't expiryTimer keep running?

We do this a few lines below:

	if stopTimer && template.expiryTimer != nil {
		template.expiryTimer.Stop()
	}

Could you also remind me where we call deleteTemplate? If we call deleteTemplate after a template is expired, deleteTemplate will return false and the expiryTimer. Stop () won't be called?

deleteTemplate is not called right now actually. I have a follow up patch which deletes templates when TCP connections are closed, but I wanted to split PRs to keep them small. Sorry for the confusion. At the moment we call deleteTemplateWithConds when expiring a template. During this call, we do not need to Stop the timer, as we are guaranteed that the timer has already fired (see comments in addTemplate).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. So stopTimer is mainly for TCP when we want to delete the template before it is expired.

yuntanghsu
yuntanghsu previously approved these changes Nov 20, 2024
Copy link
Contributor

@yuntanghsu yuntanghsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

The fake clock provided as part of k8s.io/utils has some limitations: it
does not conform to the specifications of the standard library time
package when it comes to timers (e.g., Stop() does not return false if
the timer has already been stopped), and it prevents (deadlock) clock
methods such as Now() (but also the Stop() method of timers) from being
called while executing a timer function (timer created with AfterFunc).

To avoid non-intuitive workarounds in our code caused by these
limitations, we define our own clock interface and fake clock
implementation. This does not represent too much code, as the
CollectingProcess only uses a limited amount of clock functions.

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
@antoninbas
Copy link
Member Author

@heanlan @yuntanghsu Because there was some legitimate confusion around the use of stopTimer and the correctness of the code, I have added a second commit which removes this parameter. The only reason why I added stopTimer was because of limitations in the fake clock implementation provided by k8s.io/utils. But it's usually not a good idea to modify logic to accommodate unit tests, even though the code was still correct. In the new commit, I no longer use the fake clock provided by k8s.io/utils, and I define a custom clock interface which satisfies the needs of the CollectingProcess, along with a corresponding fake clock implementation for the interface, which can be used in unit tests. PTAL, hopefully the code will make more sense now. More details in the commit message for the new commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants