Skip to content

Improve validation performance by caching compiled schematron Templates to avoid repeated XSLT compilation#1572

Open
jordanpadams wants to merge 10 commits intoNASA-PDS:mainfrom
JPL-Devin:main
Open

Improve validation performance by caching compiled schematron Templates to avoid repeated XSLT compilation#1572
jordanpadams wants to merge 10 commits intoNASA-PDS:mainfrom
JPL-Devin:main

Conversation

@jordanpadams
Copy link
Copy Markdown
Member

@jordanpadams jordanpadams commented Apr 1, 2026

🗒️ Summary

Adds a HashMap<String, Templates> cache to SchematronTransformer so the expensive ISO schematron XSLT compilation runs once per unique schematron source string rather than once per label. Subsequent calls to transform(String) return a new Transformer from the pre-compiled Templates object — eliminating the dominant CPU cost during bundle validation for large collections with many products.

Changes in SchematronTransformer.java:

  • Added cachedTemplates instance field (Map<String, Templates>)
  • Extracted XSLT compilation into private compileSchematron() returning Templates
  • Cache lookup/store added to transform(String, ProblemHandler)
  • Added clearCache() for explicit lifecycle management
  • Added debug logging for cache hits/misses

🤖 AI Assistance Disclosure

  • No AI assistance used
  • AI used for light assistance (e.g., suggestions, refactoring, documentation help, minor edits)
  • AI used for moderate content generation (AI generated some code or logic, but the developer authored or heavily revised the majority)
  • AI generated substantial portions of this code

Estimated % of code influenced by AI: 90%

⚙️ Test Data and/or Report

Existing Cucumber integration tests in src/test/resources/features/ cover schematron-based validation. No new test data required — this change is transparent to callers. Performance improvement observable via timing on any bundle validation run.

♻️ Related Issues

Fixes #1565

🤓 Reviewer Checklist

Reviewers: Please verify the following before approving this pull request.

Documentation and PR Content

  • Documentation: README, Wiki, or inline documentation (Sphinx, Javadoc, Docstrings) have been updated to reflect these changes.
  • Issue Traceability: The PR is linked to a valid GitHub Issue
  • PR Title: The PR title is "user-friendly" clearly identifying what is being fixed or the new feature being added, that if you saw it in the Release Notes for a tool, you would be able to get the gist of what was done.

Security & Quality

  • SonarCloud: Confirmed no new High or Critical security findings.
  • Secrets Detection: Verified that the Secrets Detection scan passed and no sensitive information (keys, tokens, PII) is exposed.
  • Code Quality: Code follows organization style guidelines and best practices for the specific language (e.g., PEP 8, Google Java Style).

Testing & Validation

  • Test Accuracy: Verified that test data is accurate, representative of real-world PDS4 scenarios, and sufficient for the logic being tested.
  • Coverage: Automated tests cover new logic and edge cases.
  • Local Verification: (If applicable) Successfully built and ran the changes in a local or staging environment.

Maintenance

  • Backward Compatibility: Confirmed that these changes do not break existing downstream dependencies or API contracts (or that breaking changes are clearly documented).

devin-ai-integration bot and others added 2 commits April 1, 2026 22:40
Add a HashMap<String, Templates> cache to SchematronTransformer so that
the expensive ISO schematron XSLT compilation is performed only once per
unique schematron source string. Subsequent calls to transform(String)
return a new Transformer from the cached Templates object.

- Extract compilation logic into private compileSchematron() returning Templates
- Add cache lookup in transform(String, ProblemHandler)
- Add clearCache() method (naturally reset when LabelValidator.clear() creates a new instance)
- Add debug logging for cache hits/misses

Fixes: NASA-PDS#1565
Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
…n-templates

Cache compiled schematron Templates to avoid repeated XSLT compilation
@jordanpadams jordanpadams requested a review from a team as a code owner April 1, 2026 22:56
@jordanpadams jordanpadams changed the title TBD Improve validation performance by caching compiled schematron Templates to avoid repeated XSLT compilation Apr 1, 2026
Copy link
Copy Markdown
Member Author

@jordanpadams jordanpadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review

Overall: Approve with minor comments. The implementation is correct and well-targeted. Using javax.xml.transform.Templates is exactly the right JAXP API for this — it represents a compiled, thread-safe stylesheet that produces new Transformer instances cheaply. The core fix is sound.


What's good

  • Correct API choice. Templates is JAXP's purpose-built mechanism for compiled stylesheet reuse. newTransformer() is lightweight; the expensive compilation (newTemplates()) now happens once.
  • Clean extraction. compileSchematron() as a private method reads well and avoids duplication between the cached and uncached code paths.
  • Transformer not cached. Correctly calls templates.newTransformer() on each use — Transformer is not thread-safe and must not be shared.
  • Debug logging. Cache hit/miss logging will make performance profiling straightforward.

Comments

1. HashMap is not thread-safe — use ConcurrentHashMap as a forward-looking safety measure

SchematronTransformer is currently safe because LabelValidator.parseAndValidate() is synchronized, so the cache is only accessed by one thread at a time. But the parallel-validation work (#1566, #1567) will remove that synchronized constraint. Switching to ConcurrentHashMap now costs nothing and avoids a subtle data race regression later:

// current
private final Map<String, Templates> cachedTemplates = new HashMap<>();

// safer
private final Map<String, Templates> cachedTemplates = new ConcurrentHashMap<>();

2. Cache key is the full schematron string — memory overhead for large schematrons

Using the source string itself as the key means the map holds a reference to the entire schematron text in addition to the compiled Templates. For large schematrons this doubles memory use for cached entries. A minor improvement would be keying on a hash of the source (e.g., SHA-256 hex) rather than the full string. Low priority but worth knowing.

3. clearCache() is public but not wired up anywhere — consider package-private or remove

The commit message says the cache "naturally resets when LabelValidator.clear() creates a new instance." If that's the lifecycle, clearCache() appears to be dead code. Either wire it to LabelValidator.clear() or make it package-private to limit surface area. If it's intended for future use, a // for testing comment would clarify intent.

4. transform(Source, ProblemHandler) bypass — cache is not used for all call paths

The Source-based transform(Source, ProblemHandler) path bypasses the cache entirely — it calls compileSchematron() directly every time. If any caller goes through the Source overload rather than the String overload, they get no benefit. Worth confirming that LabelValidator call sites only go through the String path in practice.

5. No test for cache behavior

There's no test asserting that compileSchematron is called only once for repeated identical inputs. A simple unit test on SchematronTransformer using a spy or call counter would protect this from regression.


Summary

Comment Severity
Use ConcurrentHashMap for future thread-safety Recommended
Cache key memory overhead Low / informational
clearCache() wiring / visibility Minor
Source-path cache bypass Minor — verify not hit in practice
No cache-hit regression test Recommended

jordanpadams and others added 4 commits April 1, 2026 23:06
Intercept non-existent file targets in doValidation() before running the
validator. Records a MISSING_REFERENCED_FILE error (PRODUCT category) directly
to the report so the product shows FAIL, the error is counted in the summary,
and the exit code is non-zero.

Previously, LocationValidator recorded the error as NO_PRODUCTS_FOUND
(ProblemCategory.EXECUTION), which Report.record() explicitly excluded from
error counts, causing the product to show PASS with 0 errors in the summary.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a HashMap<String, Templates> cache to SchematronTransformer so that
the expensive ISO schematron XSLT compilation is performed only once per
unique schematron source string. Subsequent calls to transform(String)
return a new Transformer from the cached Templates object.

- Extract compilation logic into private compileSchematron() returning Templates
- Add cache lookup in transform(String, ProblemHandler)
- Add clearCache() method (naturally reset when LabelValidator.clear() creates a new instance)
- Add debug logging for cache hits/misses

Fixes: NASA-PDS#1565
Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
1. Use ConcurrentHashMap instead of HashMap for thread-safety (NASA-PDS#1566, NASA-PDS#1567)
2. Key cache on SHA-256 hash of source string to reduce memory overhead
3. Make clearCache() package-private; add cacheSize() for testing
4. Document Source-path cache bypass design decision
5. Add SchematronTransformerTest with cache behavior assertions

Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
devin-ai-integration bot and others added 4 commits April 1, 2026 23:29
Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
1. ThreadLocal<MessageDigest> instead of per-call getInstance (avoids provider lookup)
2. HexFormat.of().formatHex() instead of String.format loop (Java 17+)
3. Document intentional check-then-act race on ConcurrentHashMap
4. @tag("integration") and class-level Javadoc noting Saxon-HE dependency on tests

Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
…ew-comments

Address PR review comments on schematron Templates cache
@al-niessner
Copy link
Copy Markdown
Contributor

@jordanpadams

Becareful with this one. I found that the schematron libraries that we use have hidden globals that cause things to be remembered. Of course that was 20 versions ago...

@jordanpadams
Copy link
Copy Markdown
Member Author

@jordanpadams

Becareful with this one. I found that the schematron libraries that we use have hidden globals that cause things to be remembered. Of course that was 20 versions ago...

@al-niessner copy that. I am going to squash these updates to ensure we can easily revert if we determine it is introducing bugs we can't fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants