[cleaner] Optimize file obfuscation with set-based lookup and pre-filters by gotoweb · Pull Request #4286 · sosreport/sos

gotoweb · 2026-04-10T14:32:04Z

Reopening this PR as the previous PR #4281 was closed during the process of resolving conflicts and recreating the branch.

Relevant: #4227

Please place an 'X' inside each '[]' to confirm you adhere to our Contributor Guidelines

Is the commit message split over multiple lines and hard-wrapped at 72 characters?
Is the subject and message clear and concise?
Does the subject start with [plugin_name] if submitting a plugin patch or a [section_name] if part of the core sosreport code?
Does the commit contain a Signed-off-by: First Lastname email@example.com?
Are any related Issues or existing PRs properly referenced via a Closes (Issue) or Resolved (PR) line?
Are all passwords or private data gathered by this PR obfuscated?

…ters Replace the linear scan over all compiled_regexes with a targeted lookup that only processes items actually present in each line. - Set-based token lookup for username and keyword maps: SoSMap gains a use_token_lookup mode (enabled for SoSUsernameMap and SoSKeywordMap). Lines are split into lowercase tokens and checked against a set of known items using O(1) set intersection. Items containing non-alphanumeric characters (e.g. 'systemd-network') fall back to substring matching. The new get_matched_items() method returns only the items found in the line, rather than the full compiled_regexes list. - Targeted iteration in parsers: _parse_line_with_compiled_regexes() and parse_string_for_keys() now call mapping.get_matched_items(line) instead of checking compiled_search then iterating all compiled_regexes. For non-token maps (IP, hostname), the original compiled_search pre-filter is preserved. - Deferred regex compilation: generate_item_regexes() sets mapping.initializing = True while adding items in bulk, then calls generate_compiled_regexes() once at the end, avoiding redundant recompilation per add_regex_item(). - Cheap pre-filters for MAC and IPv6 parsers: Added lightweight _quick_check regexes that short-circuit _parse_line() for lines that cannot contain MAC or IPv6 addresses, avoiding the expensive findall() on the vast majority of lines. Signed-off-by: Hoyong Lee <igotoweb@gmail.com>

Previously, add_regex_item() and generate_compiled_regexes() always populated both the compiled_regexes list and _regex_dict regardless of whether token-based lookup was enabled. This meant maintaining redundant data structures that were never accessed by the active code path. Separate the data structure management based on the use_token_lookup flag so that each path only populates the structures it actually uses: - Token lookup path: _regex_dict, _simple_tokens, _complex_items - Traditional path: compiled_regexes list, compiled_search regex Signed-off-by: Hoyong Lee <igotoweb@gmail.com>

packit-as-a-service · 2026-04-10T14:34:53Z

Congratulations! One of the builds has completed. 🍾

You can install the built RPMs by following these steps:

sudo dnf install -y 'dnf*-command(copr)'
dnf copr enable packit/sosreport-sos-4286
And now you can install the packages.

Please note that the RPMs should be used only in a testing environment.

pmoravec · 2026-04-11T13:14:22Z

resolving conflicts and recreating the branch.

You can also rebase your branch to current upstream, there are multiple ways how to achieve that. Anyway, whatever works for me/us (including opening new PR) :). Reviewing the code now and running same perf.comparison (preliminary results are better than before).

pmoravec · 2026-04-11T14:20:53Z

+                line, _count = reg.subn(self.mapping.get(item), line)
+                count += _count
+                # break the cycle if no further search can apply
+                if not self.mapping.compiled_search.search(line):


So here you need self.mapping.compiled_search even for use_token_lookup = True?

This brings me to counterexample:

mkdir sosreport echo "foo bar foo bar qux" > sosreport/test.txt rm -rf /etc/sos/cleaner/* /var/tmp/sos* python3 bin/sos clean sosreport --batch --keywords foo,bar,qux cat /var/tmp/sosreport

Then I randomly get

obfuscatedword0 bar obfuscatedword0 bar qux

or

foo obfuscatedword1 foo obfuscatedword1 qux

depending what is the first pair in for item, reg in self.mapping.get_matched_items(line): loop.

This condition in if not self.mapping.compiled_search.search(line): must be conditional depending on use_token_lookup value.

Thanks for the review and the counterexample, It made the bug very clear.
As you suggested, I fixed this by skipping the early-exit for token-lookup parsers

gotoweb@7e762c0

Token-lookup parsers don't maintain compiled_search, so the check was always True after the first substitution, causing the loop to break prematurely. With this guard, token-lookup parsers rely on get_matched_items() exhausting its results naturally.

I also added regression tests for multi-keyword and multi-username obfuscation to tests/unittests/cleaner_tests.py

For token-lookup maps (keyword, username), compiled_search is never updated and always fails to match, causing the early-exit in _parse_line_with_compiled_regexes() to fire after the first substitution. Guard the check with use_token_lookup so token-lookup parsers exhaust get_matched_items() naturally. Add regression tests for multi-keyword and multi-username obfuscation. Assisted-by: Claude Sonnet <https://claude.ai> Signed-off-by: Hoyong Lee <igotoweb@gmail.com>

pmoravec · 2026-04-12T13:25:40Z

-                    if not self.mapping.compiled_search.search(line):
-                        break
+        for item, reg in self.mapping.get_matched_items(line):
+            if reg.search(line):


Just a thought: for use_token_lookup=True, we know all (item, reg) pairs will be found in the line, so this search is redundant..?

Is it safe to have here

if self.mapping.use_token_lookup or reg.search(line):

?

(apologize for such iterative feedback, I should raise this question in previous rounds already..)

Thanks for the suggestion. You're right, for token-lookup maps, get_matched_items() already confirms presence via set intersection and substring match, so reg.search() is redundant here.

Applied if self.mapping.use_token_lookup or reg.search(line): in both _parse_line_with_compiled_regexes() and parse_string_for_keys().

pmoravec · 2026-04-12T13:27:00Z

        if self.compile_regexes:
-            for item, reg in self.mapping.compiled_regexes:
+            for item, reg in self.mapping.get_matched_items(string_data):
                if reg.search(string_data):


Ditto for self.mapping.use_token_lookup or .. sub-condition.

Done. same pattern applied here as well.

pmoravec · 2026-04-12T13:31:11Z

Nice improvement overall. Leaving one decision to @TurboTurtle plus one improvement on 2 places to the author, other than that I will ACK it.

A very rule of thumb performance gained will be 5-6% over all (per my benchmark; better results on systems e.g. with many users).

For token-lookup maps (keyword, username), get_matched_items() already guarantees that returned items exist in the line via set intersection and substring matching. The subsequent reg.search() call is therefore redundant and can be skipped. Guard the reg.search() check with use_token_lookup in both _parse_line_with_compiled_regexes() and parse_string_for_keys() so that token-lookup parsers proceed directly to substitution. Signed-off-by: Hoyong Lee <igotoweb@gmail.com>

pmoravec

No further comments from my side. Thanks for the improvement and iterative feedback from my side.

Comparing performance on my test suite, I see 5-10% improvement (depends on multiple factors, the bigger sosreport or default mapping, the bigger (even relative) improvement). Nice!

Let's wait for a second approval ACK now..

gotoweb added 2 commits April 10, 2026 14:11

jcastill added the Kind/cleaner cleaner component of sos label Apr 10, 2026

pmoravec reviewed Apr 11, 2026

View reviewed changes

Comment thread sos/cleaner/mappings/__init__.py

pmoravec reviewed Apr 11, 2026

View reviewed changes

Comment thread sos/cleaner/parsers/__init__.py

pmoravec reviewed Apr 11, 2026

View reviewed changes

Comment thread sos/cleaner/parsers/ipv6_parser.py

pmoravec reviewed Apr 12, 2026

View reviewed changes

pmoravec mentioned this pull request Apr 12, 2026

Improve sos clean performance #4227

Open

pmoravec approved these changes Apr 15, 2026

View reviewed changes

pmoravec added Reviewed/Needs 2nd Ack Require a 2nd ack from a maintainer Status/Needs Review This issue still needs a review from project members labels Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cleaner] Optimize file obfuscation with set-based lookup and pre-filters#4286

[cleaner] Optimize file obfuscation with set-based lookup and pre-filters#4286
gotoweb wants to merge 4 commits intososreport:mainfrom
gotoweb:faster-combined

gotoweb commented Apr 10, 2026

Uh oh!

packit-as-a-service Bot commented Apr 10, 2026

Uh oh!

pmoravec commented Apr 11, 2026

Uh oh!

Uh oh!

Uh oh!

pmoravec Apr 11, 2026

Uh oh!

gotoweb Apr 12, 2026

Uh oh!

Uh oh!

pmoravec Apr 12, 2026 •

edited

Loading

Uh oh!

gotoweb Apr 15, 2026

Uh oh!

pmoravec Apr 12, 2026

Uh oh!

gotoweb Apr 15, 2026

Uh oh!

pmoravec commented Apr 12, 2026

Uh oh!

pmoravec left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gotoweb commented Apr 10, 2026

Uh oh!

packit-as-a-service Bot commented Apr 10, 2026

Uh oh!

pmoravec commented Apr 11, 2026

Uh oh!

Uh oh!

Uh oh!

pmoravec Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gotoweb Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pmoravec Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gotoweb Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

pmoravec Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gotoweb Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

pmoravec commented Apr 12, 2026

Uh oh!

pmoravec left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pmoravec Apr 12, 2026 •

edited

Loading