Skip to content

[cleaner] Optimize file obfuscation with set-based lookup and pre-filters#4286

Open
gotoweb wants to merge 4 commits intososreport:mainfrom
gotoweb:faster-combined
Open

[cleaner] Optimize file obfuscation with set-based lookup and pre-filters#4286
gotoweb wants to merge 4 commits intososreport:mainfrom
gotoweb:faster-combined

Conversation

@gotoweb
Copy link
Copy Markdown

@gotoweb gotoweb commented Apr 10, 2026

Reopening this PR as the previous PR #4281 was closed during the process of resolving conflicts and recreating the branch.

Relevant: #4227


Please place an 'X' inside each '[]' to confirm you adhere to our Contributor Guidelines

  • Is the commit message split over multiple lines and hard-wrapped at 72 characters?
  • Is the subject and message clear and concise?
  • Does the subject start with [plugin_name] if submitting a plugin patch or a [section_name] if part of the core sosreport code?
  • Does the commit contain a Signed-off-by: First Lastname email@example.com?
  • Are any related Issues or existing PRs properly referenced via a Closes (Issue) or Resolved (PR) line?
  • Are all passwords or private data gathered by this PR obfuscated?

gotoweb added 2 commits April 10, 2026 14:11
…ters

Replace the linear scan over all compiled_regexes with a targeted
lookup that only processes items actually present in each line.

- Set-based token lookup for username and keyword maps:
  SoSMap gains a use_token_lookup mode (enabled for SoSUsernameMap and
  SoSKeywordMap). Lines are split into lowercase tokens and checked
  against a set of known items using O(1) set intersection. Items
  containing non-alphanumeric characters (e.g. 'systemd-network') fall
  back to substring matching. The new get_matched_items() method
  returns only the items found in the line, rather than the full
  compiled_regexes list.

- Targeted iteration in parsers:
  _parse_line_with_compiled_regexes() and parse_string_for_keys() now
  call mapping.get_matched_items(line) instead of checking
  compiled_search then iterating all compiled_regexes. For non-token
  maps (IP, hostname), the original compiled_search pre-filter is
  preserved.

- Deferred regex compilation:
  generate_item_regexes() sets mapping.initializing = True while
  adding items in bulk, then calls generate_compiled_regexes() once
  at the end, avoiding redundant recompilation per add_regex_item().

- Cheap pre-filters for MAC and IPv6 parsers:
  Added lightweight _quick_check regexes that short-circuit
  _parse_line() for lines that cannot contain MAC or IPv6 addresses,
  avoiding the expensive findall() on the vast majority of lines.

Signed-off-by: Hoyong Lee <igotoweb@gmail.com>
Previously, add_regex_item() and generate_compiled_regexes()
always populated both the compiled_regexes list and _regex_dict
regardless of whether token-based lookup was enabled. This meant
maintaining redundant data structures that were never accessed
by the active code path.

Separate the data structure management based on the
use_token_lookup flag so that each path only populates the
structures it actually uses:

- Token lookup path: _regex_dict, _simple_tokens, _complex_items
- Traditional path: compiled_regexes list, compiled_search regex

Signed-off-by: Hoyong Lee <igotoweb@gmail.com>
@packit-as-a-service
Copy link
Copy Markdown

Congratulations! One of the builds has completed. 🍾

You can install the built RPMs by following these steps:

  • sudo dnf install -y 'dnf*-command(copr)'
  • dnf copr enable packit/sosreport-sos-4286
  • And now you can install the packages.

Please note that the RPMs should be used only in a testing environment.

@jcastill jcastill added the Kind/cleaner cleaner component of sos label Apr 10, 2026
@pmoravec
Copy link
Copy Markdown
Contributor

resolving conflicts and recreating the branch.

You can also rebase your branch to current upstream, there are multiple ways how to achieve that. Anyway, whatever works for me/us (including opening new PR) :). Reviewing the code now and running same perf.comparison (preliminary results are better than before).

Comment thread sos/cleaner/mappings/__init__.py
Comment thread sos/cleaner/parsers/__init__.py
Comment thread sos/cleaner/parsers/__init__.py Outdated
line, _count = reg.subn(self.mapping.get(item), line)
count += _count
# break the cycle if no further search can apply
if not self.mapping.compiled_search.search(line):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here you need self.mapping.compiled_search even for use_token_lookup = True?

This brings me to counterexample:

mkdir sosreport
echo "foo bar foo bar qux" > sosreport/test.txt
rm -rf /etc/sos/cleaner/* /var/tmp/sos*
python3 bin/sos clean sosreport --batch --keywords foo,bar,qux
cat /var/tmp/sosreport

Then I randomly get

obfuscatedword0 bar obfuscatedword0 bar qux

or

foo obfuscatedword1 foo obfuscatedword1 qux

depending what is the first pair in for item, reg in self.mapping.get_matched_items(line): loop.

This condition in if not self.mapping.compiled_search.search(line): must be conditional depending on use_token_lookup value.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review and the counterexample, It made the bug very clear.
As you suggested, I fixed this by skipping the early-exit for token-lookup parsers

gotoweb@7e762c0

Token-lookup parsers don't maintain compiled_search, so the check was always True after the first substitution, causing the loop to break prematurely. With this guard, token-lookup parsers rely on get_matched_items() exhausting its results naturally.

I also added regression tests for multi-keyword and multi-username obfuscation to tests/unittests/cleaner_tests.py

Comment thread sos/cleaner/parsers/ipv6_parser.py
For token-lookup maps (keyword, username), compiled_search is never
updated and always fails to match, causing the early-exit in
_parse_line_with_compiled_regexes() to fire after the first
substitution. Guard the check with use_token_lookup so token-lookup
parsers exhaust get_matched_items() naturally.

Add regression tests for multi-keyword and multi-username obfuscation.

Assisted-by: Claude Sonnet <https://claude.ai>
Signed-off-by: Hoyong Lee <igotoweb@gmail.com>
Comment thread sos/cleaner/parsers/__init__.py Outdated
if not self.mapping.compiled_search.search(line):
break
for item, reg in self.mapping.get_matched_items(line):
if reg.search(line):
Copy link
Copy Markdown
Contributor

@pmoravec pmoravec Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought: for use_token_lookup=True, we know all (item, reg) pairs will be found in the line, so this search is redundant..?

Is it safe to have here

            if self.mapping.use_token_lookup or reg.search(line):

?

(apologize for such iterative feedback, I should raise this question in previous rounds already..)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. You're right, for token-lookup maps, get_matched_items() already confirms presence via set intersection and substring match, so reg.search() is redundant here.

Applied if self.mapping.use_token_lookup or reg.search(line): in both _parse_line_with_compiled_regexes() and parse_string_for_keys().

Comment thread sos/cleaner/parsers/__init__.py Outdated
if self.compile_regexes:
for item, reg in self.mapping.compiled_regexes:
for item, reg in self.mapping.get_matched_items(string_data):
if reg.search(string_data):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for self.mapping.use_token_lookup or .. sub-condition.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. same pattern applied here as well.

@pmoravec
Copy link
Copy Markdown
Contributor

Nice improvement overall. Leaving one decision to @TurboTurtle plus one improvement on 2 places to the author, other than that I will ACK it.

A very rule of thumb performance gained will be 5-6% over all (per my benchmark; better results on systems e.g. with many users).

For token-lookup maps (keyword, username), get_matched_items()
already guarantees that returned items exist in the line via set
intersection and substring matching. The subsequent reg.search()
call is therefore redundant and can be skipped.

Guard the reg.search() check with use_token_lookup in both
_parse_line_with_compiled_regexes() and parse_string_for_keys()
so that token-lookup parsers proceed directly to substitution.

Signed-off-by: Hoyong Lee <igotoweb@gmail.com>
Copy link
Copy Markdown
Contributor

@pmoravec pmoravec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No further comments from my side. Thanks for the improvement and iterative feedback from my side.

Comparing performance on my test suite, I see 5-10% improvement (depends on multiple factors, the bigger sosreport or default mapping, the bigger (even relative) improvement). Nice!

Let's wait for a second approval ACK now..

@pmoravec pmoravec added Reviewed/Needs 2nd Ack Require a 2nd ack from a maintainer Status/Needs Review This issue still needs a review from project members labels Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Kind/cleaner cleaner component of sos Reviewed/Needs 2nd Ack Require a 2nd ack from a maintainer Status/Needs Review This issue still needs a review from project members

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants