[cleaner] Optimize file obfuscation with set-based lookup and pre-filters#4286
[cleaner] Optimize file obfuscation with set-based lookup and pre-filters#4286gotoweb wants to merge 4 commits intososreport:mainfrom
Conversation
…ters Replace the linear scan over all compiled_regexes with a targeted lookup that only processes items actually present in each line. - Set-based token lookup for username and keyword maps: SoSMap gains a use_token_lookup mode (enabled for SoSUsernameMap and SoSKeywordMap). Lines are split into lowercase tokens and checked against a set of known items using O(1) set intersection. Items containing non-alphanumeric characters (e.g. 'systemd-network') fall back to substring matching. The new get_matched_items() method returns only the items found in the line, rather than the full compiled_regexes list. - Targeted iteration in parsers: _parse_line_with_compiled_regexes() and parse_string_for_keys() now call mapping.get_matched_items(line) instead of checking compiled_search then iterating all compiled_regexes. For non-token maps (IP, hostname), the original compiled_search pre-filter is preserved. - Deferred regex compilation: generate_item_regexes() sets mapping.initializing = True while adding items in bulk, then calls generate_compiled_regexes() once at the end, avoiding redundant recompilation per add_regex_item(). - Cheap pre-filters for MAC and IPv6 parsers: Added lightweight _quick_check regexes that short-circuit _parse_line() for lines that cannot contain MAC or IPv6 addresses, avoiding the expensive findall() on the vast majority of lines. Signed-off-by: Hoyong Lee <igotoweb@gmail.com>
Previously, add_regex_item() and generate_compiled_regexes() always populated both the compiled_regexes list and _regex_dict regardless of whether token-based lookup was enabled. This meant maintaining redundant data structures that were never accessed by the active code path. Separate the data structure management based on the use_token_lookup flag so that each path only populates the structures it actually uses: - Token lookup path: _regex_dict, _simple_tokens, _complex_items - Traditional path: compiled_regexes list, compiled_search regex Signed-off-by: Hoyong Lee <igotoweb@gmail.com>
|
Congratulations! One of the builds has completed. 🍾 You can install the built RPMs by following these steps:
Please note that the RPMs should be used only in a testing environment. |
You can also rebase your branch to current upstream, there are multiple ways how to achieve that. Anyway, whatever works for me/us (including opening new PR) :). Reviewing the code now and running same perf.comparison (preliminary results are better than before). |
| line, _count = reg.subn(self.mapping.get(item), line) | ||
| count += _count | ||
| # break the cycle if no further search can apply | ||
| if not self.mapping.compiled_search.search(line): |
There was a problem hiding this comment.
So here you need self.mapping.compiled_search even for use_token_lookup = True?
This brings me to counterexample:
mkdir sosreport
echo "foo bar foo bar qux" > sosreport/test.txt
rm -rf /etc/sos/cleaner/* /var/tmp/sos*
python3 bin/sos clean sosreport --batch --keywords foo,bar,qux
cat /var/tmp/sosreport
Then I randomly get
obfuscatedword0 bar obfuscatedword0 bar qux
or
foo obfuscatedword1 foo obfuscatedword1 qux
depending what is the first pair in for item, reg in self.mapping.get_matched_items(line): loop.
This condition in if not self.mapping.compiled_search.search(line): must be conditional depending on use_token_lookup value.
There was a problem hiding this comment.
Thanks for the review and the counterexample, It made the bug very clear.
As you suggested, I fixed this by skipping the early-exit for token-lookup parsers
Token-lookup parsers don't maintain compiled_search, so the check was always True after the first substitution, causing the loop to break prematurely. With this guard, token-lookup parsers rely on get_matched_items() exhausting its results naturally.
I also added regression tests for multi-keyword and multi-username obfuscation to tests/unittests/cleaner_tests.py
For token-lookup maps (keyword, username), compiled_search is never updated and always fails to match, causing the early-exit in _parse_line_with_compiled_regexes() to fire after the first substitution. Guard the check with use_token_lookup so token-lookup parsers exhaust get_matched_items() naturally. Add regression tests for multi-keyword and multi-username obfuscation. Assisted-by: Claude Sonnet <https://claude.ai> Signed-off-by: Hoyong Lee <igotoweb@gmail.com>
| if not self.mapping.compiled_search.search(line): | ||
| break | ||
| for item, reg in self.mapping.get_matched_items(line): | ||
| if reg.search(line): |
There was a problem hiding this comment.
Just a thought: for use_token_lookup=True, we know all (item, reg) pairs will be found in the line, so this search is redundant..?
Is it safe to have here
if self.mapping.use_token_lookup or reg.search(line):
?
(apologize for such iterative feedback, I should raise this question in previous rounds already..)
There was a problem hiding this comment.
Thanks for the suggestion. You're right, for token-lookup maps, get_matched_items() already confirms presence via set intersection and substring match, so reg.search() is redundant here.
Applied if self.mapping.use_token_lookup or reg.search(line): in both _parse_line_with_compiled_regexes() and parse_string_for_keys().
| if self.compile_regexes: | ||
| for item, reg in self.mapping.compiled_regexes: | ||
| for item, reg in self.mapping.get_matched_items(string_data): | ||
| if reg.search(string_data): |
There was a problem hiding this comment.
Ditto for self.mapping.use_token_lookup or .. sub-condition.
There was a problem hiding this comment.
Done. same pattern applied here as well.
|
Nice improvement overall. Leaving one decision to @TurboTurtle plus one improvement on 2 places to the author, other than that I will ACK it. A very rule of thumb performance gained will be 5-6% over all (per my benchmark; better results on systems e.g. with many users). |
For token-lookup maps (keyword, username), get_matched_items() already guarantees that returned items exist in the line via set intersection and substring matching. The subsequent reg.search() call is therefore redundant and can be skipped. Guard the reg.search() check with use_token_lookup in both _parse_line_with_compiled_regexes() and parse_string_for_keys() so that token-lookup parsers proceed directly to substitution. Signed-off-by: Hoyong Lee <igotoweb@gmail.com>
pmoravec
left a comment
There was a problem hiding this comment.
No further comments from my side. Thanks for the improvement and iterative feedback from my side.
Comparing performance on my test suite, I see 5-10% improvement (depends on multiple factors, the bigger sosreport or default mapping, the bigger (even relative) improvement). Nice!
Let's wait for a second approval ACK now..
Reopening this PR as the previous PR #4281 was closed during the process of resolving conflicts and recreating the branch.
Relevant: #4227
Please place an 'X' inside each '[]' to confirm you adhere to our Contributor Guidelines