Skip to content

Fixes #1852 - Unmap addresses from routers that become unreachable in topology changes.#1854

Open
ted-ross wants to merge 9 commits intoskupperproject:mainfrom
ted-ross:tross-1852
Open

Fixes #1852 - Unmap addresses from routers that become unreachable in topology changes.#1854
ted-ross wants to merge 9 commits intoskupperproject:mainfrom
ted-ross:tross-1852

Conversation

@ted-ross
Copy link
Copy Markdown
Member

NOTE: This is a draft PR. I'm putting this here as a suggested way to solve the problem in the raised issue.

This needs more testing and examination before it can be accepted.

@ted-ross ted-ross requested a review from gabordozsa February 27, 2026 18:58
@github-advanced-security
Copy link
Copy Markdown

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

@gabordozsa
Copy link
Copy Markdown
Collaborator

@ted-ross Would it make sense to unmap the addresses in two phases in order to give a chance to a lost link to recover? I.e. mark the unreachable nodes first and call unmap a few ticks later if they are still unreachable. I'm thinking of a router losing the access to a large number of other routers due to a peer is being restarted.

@ted-ross
Copy link
Copy Markdown
Member Author

ted-ross commented Mar 2, 2026

@ted-ross Would it make sense to unmap the addresses in two phases in order to give a chance to a lost link to recover? I.e. mark the unreachable nodes first and call unmap a few ticks later if they are still unreachable. I'm thinking of a router losing the access to a large number of other routers due to a peer is being restarted.

This is a good suggestion. My main concern with this PR is that it might cause a lot of churn if there was a momentary loss of connection. This is why the original functionality was in place, but 60 seconds is arguably much too long of an interval to allow disconnects to get fixed.

@ted-ross
Copy link
Copy Markdown
Member Author

ted-ross commented Mar 2, 2026

Further to @gabordozsa 's suggestion...

We could add a new configuration interval called removeUnreachableHoldoffSeconds (or similar) to indicate how many seconds the router should wait after declaring a peer-router "unreachable" until its addresses are flushed. A good default might be 1 or 2 seconds, but it could be increased for larger networks or networks with many addresses.

… of addresses on a remote router that becomes unreachable.
Comment thread python/skupper_router_internal/router/node.py Outdated
@ted-ross
Copy link
Copy Markdown
Member Author

This PR needs a test to validate the new functionality.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.2%. Comparing base (60cd6ec) to head (79b7c2f).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##            main   #1854     +/-   ##
=======================================
- Coverage   70.5%   67.2%   -3.4%     
=======================================
  Files        209     204      -5     
  Lines      47838   47716    -122     
  Branches    5156    5326    +170     
=======================================
- Hits       33765   32081   -1684     
- Misses     11608   13405   +1797     
+ Partials    2465    2230    -235     
Flag Coverage Δ
pysystemtests 77.3% <ø> (+0.9%) ⬆️
systemtests 61.9% <ø> (-5.6%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
unittests 67.2% <84.5%> (-3.4%) ⬇️
systemtests 67.2% <84.5%> (-3.4%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member Author

@ted-ross ted-ross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to the PR are very good. I inserted a couple of questions for discussion before we approve this.

Comment thread python/skupper_router_internal/router/node.py Outdated
Comment thread python/skupper_router_internal/router/node.py Outdated
Comment thread python/skupper_router_internal/router/node.py
@ted-ross ted-ross marked this pull request as ready for review April 14, 2026 14:46
@gabordozsa gabordozsa force-pushed the tross-1852 branch 2 times, most recently from 24017bc to 03f061b Compare April 15, 2026 20:41
- Create a common base class for the non/TLS and TLS tests in order to avoid launching
  duplicate router and grpc server instances for the TLS tests

- Disable fork support for grpcio in order to stabilize the test on CentOS 9 (python 3.9)
- pip cannot upgrade typing_extensions if an older version is installed from deb package
- make 'sudo' pick-up the PIP_BREAK_SYSTEM_PACKAGES env var
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants