[orchagent] RouteOrch cannot consume new routes if there are enough routes being tried in the m_toSync #3027

goomadao · 2024-01-29T03:45:21Z

Description

When there are many routes being retried in the consumer.m_toSync of ROUTE TABLE all the time (be blocked by the Neighbor non-existance or something), the Consumer will be not able to pops() any new routes by calling the Consumer::execute() function. The amount of the retrying routes to trigger this issue depends on the shortest Timer whose priority is higher than the ROUTE TABLE Consumer. The priority of the ROUTE TABLE Consumer is 5.

Steps to reproduce the issue

Distribute routes referencing NHG 5822 which does not exist or is deleted earlier
Diliver NHG 16518
Updating all the routes to reference NHG 16518

Describe the results you received

The old routes are retrying all the time & the new routes cannot be consumed. RouteOrch stucks here.

Describe the results you expected

New routes are able to be consumed and processed by route orch properly.

Output of show version

Output of show techsupport

(paste your output here or download and attach the file here)

Root cause of this issue

In the OrchDaemon::start(), a Selectable is selected and its execute() function will be called. After that, doTask() of all orchs will be triggered and retry all the remaining tasks. Therefore, if there are enough routes being retried, and there is a Timer whose priority is higher than the ROUTE TABLE Consumer, and the interval of this Timer is shorter than the retrying duration, the ROUTE TABLE Consumer will never be selected. In other words, new routes will never be consumed.

Additional information you deem important (e.g. issue happens only occasionally):

This was triggered occasionally in our testbed where the BGP was flapping and some interfaces were shutting down & starting up. And it may contribute to this issue that we have an additional Timer whose interval is 50ms.

Possible solution

Modify the mechanism for retrying. For example, we can do the retry operation every two loops. We can also limit this change within only the route orch to narrow the influencing scope.

The text was updated successfully, but these errors were encountered:

goomadao · 2024-01-29T10:24:58Z

Another problem is that the priority does not take effect at present. As is shown below, the priority of the ROUTE TABLE Consumer is 0, not 5 as defined. In this situation, the above issue won't happen.

To make the priority valid, the following changes can be applied.

--- a/orchagent/orch.h
+++ b/orchagent/orch.h
@@ -96,7 +96,8 @@ class Executor : public swss::Selectable
 {
 public:
     Executor(swss::Selectable *selectable, Orch *orch, const std::string &name)
-        : m_selectable(selectable)
+        : Selectable(selectable->getPri())
+        , m_selectable(selectable)
         , m_orch(orch)
         , m_name(name)
     {

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[orchagent] RouteOrch cannot consume new routes if there are enough routes being tried in the m_toSync #3027

[orchagent] RouteOrch cannot consume new routes if there are enough routes being tried in the m_toSync #3027

goomadao commented Jan 29, 2024 •

edited

Loading

goomadao commented Jan 29, 2024

[orchagent] RouteOrch cannot consume new routes if there are enough routes being tried in the m_toSync #3027

[orchagent] RouteOrch cannot consume new routes if there are enough routes being tried in the m_toSync #3027

Comments

goomadao commented Jan 29, 2024 • edited Loading

Description

Steps to reproduce the issue

Describe the results you received

Describe the results you expected

Output of show version

Output of show techsupport

Root cause of this issue

Additional information you deem important (e.g. issue happens only occasionally):

Possible solution

goomadao commented Jan 29, 2024

goomadao commented Jan 29, 2024 •

edited

Loading