Improve validation performance by caching compiled schematron Templates to avoid repeated XSLT compilation#1572
Improve validation performance by caching compiled schematron Templates to avoid repeated XSLT compilation#1572jordanpadams wants to merge 10 commits intoNASA-PDS:mainfrom
Conversation
Add a HashMap<String, Templates> cache to SchematronTransformer so that the expensive ISO schematron XSLT compilation is performed only once per unique schematron source string. Subsequent calls to transform(String) return a new Transformer from the cached Templates object. - Extract compilation logic into private compileSchematron() returning Templates - Add cache lookup in transform(String, ProblemHandler) - Add clearCache() method (naturally reset when LabelValidator.clear() creates a new instance) - Add debug logging for cache hits/misses Fixes: NASA-PDS#1565 Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
…n-templates Cache compiled schematron Templates to avoid repeated XSLT compilation
jordanpadams
left a comment
There was a problem hiding this comment.
PR Review
Overall: Approve with minor comments. The implementation is correct and well-targeted. Using javax.xml.transform.Templates is exactly the right JAXP API for this — it represents a compiled, thread-safe stylesheet that produces new Transformer instances cheaply. The core fix is sound.
What's good
- Correct API choice.
Templatesis JAXP's purpose-built mechanism for compiled stylesheet reuse.newTransformer()is lightweight; the expensive compilation (newTemplates()) now happens once. - Clean extraction.
compileSchematron()as a private method reads well and avoids duplication between the cached and uncached code paths. Transformernot cached. Correctly callstemplates.newTransformer()on each use —Transformeris not thread-safe and must not be shared.- Debug logging. Cache hit/miss logging will make performance profiling straightforward.
Comments
1. HashMap is not thread-safe — use ConcurrentHashMap as a forward-looking safety measure
SchematronTransformer is currently safe because LabelValidator.parseAndValidate() is synchronized, so the cache is only accessed by one thread at a time. But the parallel-validation work (#1566, #1567) will remove that synchronized constraint. Switching to ConcurrentHashMap now costs nothing and avoids a subtle data race regression later:
// current
private final Map<String, Templates> cachedTemplates = new HashMap<>();
// safer
private final Map<String, Templates> cachedTemplates = new ConcurrentHashMap<>();2. Cache key is the full schematron string — memory overhead for large schematrons
Using the source string itself as the key means the map holds a reference to the entire schematron text in addition to the compiled Templates. For large schematrons this doubles memory use for cached entries. A minor improvement would be keying on a hash of the source (e.g., SHA-256 hex) rather than the full string. Low priority but worth knowing.
3. clearCache() is public but not wired up anywhere — consider package-private or remove
The commit message says the cache "naturally resets when LabelValidator.clear() creates a new instance." If that's the lifecycle, clearCache() appears to be dead code. Either wire it to LabelValidator.clear() or make it package-private to limit surface area. If it's intended for future use, a // for testing comment would clarify intent.
4. transform(Source, ProblemHandler) bypass — cache is not used for all call paths
The Source-based transform(Source, ProblemHandler) path bypasses the cache entirely — it calls compileSchematron() directly every time. If any caller goes through the Source overload rather than the String overload, they get no benefit. Worth confirming that LabelValidator call sites only go through the String path in practice.
5. No test for cache behavior
There's no test asserting that compileSchematron is called only once for repeated identical inputs. A simple unit test on SchematronTransformer using a spy or call counter would protect this from regression.
Summary
| Comment | Severity |
|---|---|
Use ConcurrentHashMap for future thread-safety |
Recommended |
| Cache key memory overhead | Low / informational |
clearCache() wiring / visibility |
Minor |
Source-path cache bypass |
Minor — verify not hit in practice |
| No cache-hit regression test | Recommended |
Intercept non-existent file targets in doValidation() before running the validator. Records a MISSING_REFERENCED_FILE error (PRODUCT category) directly to the report so the product shows FAIL, the error is counted in the summary, and the exit code is non-zero. Previously, LocationValidator recorded the error as NO_PRODUCTS_FOUND (ProblemCategory.EXECUTION), which Report.record() explicitly excluded from error counts, causing the product to show PASS with 0 errors in the summary. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a HashMap<String, Templates> cache to SchematronTransformer so that the expensive ISO schematron XSLT compilation is performed only once per unique schematron source string. Subsequent calls to transform(String) return a new Transformer from the cached Templates object. - Extract compilation logic into private compileSchematron() returning Templates - Add cache lookup in transform(String, ProblemHandler) - Add clearCache() method (naturally reset when LabelValidator.clear() creates a new instance) - Add debug logging for cache hits/misses Fixes: NASA-PDS#1565 Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
1. Use ConcurrentHashMap instead of HashMap for thread-safety (NASA-PDS#1566, NASA-PDS#1567) 2. Key cache on SHA-256 hash of source string to reduce memory overhead 3. Make clearCache() package-private; add cacheSize() for testing 4. Document Source-path cache bypass design decision 5. Add SchematronTransformerTest with cache behavior assertions Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
1. ThreadLocal<MessageDigest> instead of per-call getInstance (avoids provider lookup) 2. HexFormat.of().formatHex() instead of String.format loop (Java 17+) 3. Document intentional check-then-act race on ConcurrentHashMap 4. @tag("integration") and class-level Javadoc noting Saxon-HE dependency on tests Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
…ew-comments Address PR review comments on schematron Templates cache
|
Becareful with this one. I found that the schematron libraries that we use have hidden globals that cause things to be remembered. Of course that was 20 versions ago... |
@al-niessner copy that. I am going to squash these updates to ensure we can easily revert if we determine it is introducing bugs we can't fix. |
🗒️ Summary
Adds a
HashMap<String, Templates>cache toSchematronTransformerso the expensive ISO schematron XSLT compilation runs once per unique schematron source string rather than once per label. Subsequent calls totransform(String)return a newTransformerfrom the pre-compiledTemplatesobject — eliminating the dominant CPU cost during bundle validation for large collections with many products.Changes in
SchematronTransformer.java:cachedTemplatesinstance field (Map<String, Templates>)compileSchematron()returningTemplatestransform(String, ProblemHandler)clearCache()for explicit lifecycle management🤖 AI Assistance Disclosure
Estimated % of code influenced by AI: 90%
⚙️ Test Data and/or Report
Existing Cucumber integration tests in
src/test/resources/features/cover schematron-based validation. No new test data required — this change is transparent to callers. Performance improvement observable via timing on any bundle validation run.♻️ Related Issues
Fixes #1565
🤓 Reviewer Checklist
Reviewers: Please verify the following before approving this pull request.
Documentation and PR Content
Security & Quality
Testing & Validation
Maintenance