Skip to content

refactor: remove synchronized bottleneck from LabelValidator using ThreadLocal#3

Open
devin-ai-integration[bot] wants to merge 5 commits intomainfrom
devin/1775083777-threadlocal-label-validator
Open

refactor: remove synchronized bottleneck from LabelValidator using ThreadLocal#3
devin-ai-integration[bot] wants to merge 5 commits intomainfrom
devin/1775083777-threadlocal-label-validator

Conversation

@devin-ai-integration
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration bot commented Apr 1, 2026

Summary

Removes the synchronized keyword from parseAndValidate() and all validate() overloads in LabelValidator, which was serializing all label validation in multi-threaded runs because LabelValidator is used as a singleton via ValidationResourceManager.

The per-call mutable parsing state (cachedParser, cachedValidatorHandler, docBuilder, cachedSchematron, cachedLSResolver, validatingSchema, plus the non-thread-safe SAXParserFactory and SchemaFactory) is moved into a private static inner class ParserState, held in a ThreadLocal<ParserState>. Each thread now gets its own isolated parser instances.

filesProcessed is changed to AtomicLong and totalTimeElapsed to DoubleAdder for lock-free concurrent accumulation.

Resolves NASA-PDS#1566

Updates since last revision

Round 2 — Six concurrency bugs from initial review

  1. cachedLabelSchematrons data race — Changed from HashMap to ConcurrentHashMap; replaced containsKey + put with computeIfAbsent(). The field is now volatile so that clear()'s replacement is visible to all threads.
  2. resolver.setProblemHandler() raceCachedLSResourceResolver.handler is now a ThreadLocal<ProblemHandler> so concurrent callers don't overwrite each other's handler.
  3. setCachedLSResourceResolver() thread scoping — Reverted to a shared volatile instance field (sharedCachedLSResolver) that is propagated to each thread's ParserState during createParserIfNeeded.
  4. clear() only resetting one thread — Added an AtomicLong configGeneration counter. clear() increments it; each ParserState stores the generation it was created for. parseAndValidate() checks for a mismatch and discards stale state.
  5. skipProductValidation not initialized in clear() — Added skipProductValidation = false in clear().
  6. Thread-safety documentation — Added class-level Javadoc documenting the thread-safety contract (setup-time vs. validation-time methods, generation counter lifecycle). Noted that LabelUtil.setLocation() and registerIMVersion() are internally synchronized.
  7. Concurrent integration test — Added LabelValidatorConcurrencyTest with two tests: concurrentParseAndValidate_sameLabel (4 threads × 3 iterations) and clearInvalidatesAllThreads (verifies generation counter). Uses local schemas via XML catalog rewriting to avoid network I/O.

Round 3 — Three remaining issues from re-review

  1. CachedLSResourceResolver.cachedEntities data race — Changed from HashMap to ConcurrentHashMap; replaced containsKey/put with get() + putIfAbsent() pattern. Multiple threads may redundantly fetch the same resource, but only one wins the putIfAbsent; no data corruption.
  2. Dead catch (TransformerException) block — Removed unreachable catch in loadLabelSchematrons. After the computeIfAbsent lambda wraps checked exceptions in RuntimeException, the original catch (TransformerException te) was dead code. The catch (RuntimeException re) block now unwraps TransformerException from the cause and handles other runtime exceptions with a descriptive message.
  3. Narrow concurrent test coverage — Added concurrentParseAndValidate_withSchematron() test (4 threads × 2 iterations) that enables label-schematron validation, exercising loadLabelSchematrons / cachedLabelSchematrons concurrent computeIfAbsent and the shared CachedLSResourceResolver under concurrent access.

Review & Testing Checklist for Human

  • computeIfAbsent null-safety in loadLabelSchematrons. ConcurrentHashMap.computeIfAbsent() throws NullPointerException if the mapping function returns null. If schematronTransformer.fetch() can ever return null, this will throw at runtime instead of caching a missing entry. Verify that fetch() always returns a non-null String on success, or add a null guard.
  • RuntimeException unwrapping in loadLabelSchematrons. The catch (RuntimeException re) block inspects re.getCause() to distinguish TransformerException from other failures. If a genuine (non-wrapping) RuntimeException is thrown by computeIfAbsent internals or by fetch() itself, the cause may be null, which the code handles but may produce a less informative error message. Worth a quick read of the catch block logic.
  • ParserState constructor failure surface. ParserConfigurationException is wrapped in an unchecked RuntimeException since ThreadLocal.withInitial cannot throw checked exceptions. Verify this is acceptable for your error-handling strategy.
  • Concurrent tests skip silently if test resources are missing. findTestLabel() returns null when github71/ELE_MOM.xml is absent, and tests early-return without assertion failure. In CI this means the tests pass vacuously if the resource path changes. Consider whether Assumptions.assumeTrue() or fail() is more appropriate.

Suggested manual test plan: Run a real multi-threaded validation (e.g., validate -t /path/to/bundle/ --threads 4) against a bundle with schematron-validated labels and compare results (error/warning counts) against a single-threaded run to confirm identical output.

Notes

  • The 38 pre-existing cucumber failures in pre.3.6.x.feature are unrelated to this change (count mismatches present on main).
  • SchematronTransformer is still shared across threads but appears stateless per-call (creates fresh TransformerFactory instances internally), so this should be safe.
  • Shared configuration fields (resolver, useLabelSchema, configurations, etc.) are read without synchronization in parseAndValidate. The class-level Javadoc now documents the contract that these are setup-time only.

Link to Devin session: https://nasa-jpl-demo.devinenterprise.com/sessions/27f58bf067b14671aade159c64df6ab0
Requested by: @jordanpadams


Open with Devin

…readLocal

Wrap per-thread mutable parsing state (cachedParser, cachedValidatorHandler,
docBuilder, cachedSchematron, cachedLSResolver, validatingSchema,
saxParserFactory, schemaFactory) in a ThreadLocal<ParserState> holder so each
thread gets its own instances. This allows removing synchronized from
parseAndValidate(), validate(ProblemHandler, URL), validate(ProblemHandler, File),
and validate(ProblemHandler, URL, String).

Also change filesProcessed to AtomicLong and totalTimeElapsed to DoubleAdder
for lock-free concurrent updates, and update clear() to call
threadState.remove().

Resolves NASA-PDS#1566

Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
@devin-ai-integration
Copy link
Copy Markdown
Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown

@jordanpadams jordanpadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Remove synchronized Bottleneck from LabelValidator via ThreadLocal

The approach is sound — moving per-thread mutable state into ThreadLocal<ParserState> is the right pattern for this kind of singleton parser. However there are real concurrency bugs that must be fixed before this can safely land.

Bug (Must Fix): cachedLabelSchematrons HashMap Is Now a Data Race

loadLabelSchematrons() does containsKey + put on the shared cachedLabelSchematrons (a plain HashMap) from inside parseAndValidate(), which is no longer synchronized. When useLabelSchematron is true and multiple threads validate concurrently, this is an unsynchronized read-write to a shared mutable HashMap — a classic data race that can cause ConcurrentModificationException or silent corruption.

Fix: change cachedLabelSchematrons to a ConcurrentHashMap and replace the containsKey + put pattern with computeIfAbsent().

Bug (Must Fix): setCachedLSResourceResolver() Now Scoped to Calling Thread Only

public void setCachedLSResourceResolver(CachedLSResourceResolver resolver) {
    this.threadState.get().cachedLSResolver = resolver;  // only sets THIS thread's state
}

Any caller that sets a custom resolver and then triggers validation from a different thread will have its resolver silently ignored. The other threads will use whatever resolver their ParserState was initialized with. This is a silent behavior change from the previous global assignment. Either document clearly that this method only affects the calling thread, or find a way to propagate it to all threads (e.g., store as a shared instance field that overrides the per-thread default in createParserIfNeeded).

Concern: Memory Leak in Long-Lived Thread Pools

threadState is an instance field, not static. When a LabelValidator is discarded but the threads that used it are kept alive (e.g., in an executor thread pool), those threads' ThreadLocal entries hold ParserState objects that will not be collected until the thread dies or threadState.remove() is called on that thread. Since clear() only removes the calling thread's state, threads that performed validation but never called clear() will leak ParserState for the lifetime of the thread.

This is especially relevant if LabelValidator instances are created and discarded per-validation-run while threads are reused. Consider adding a try-finally in parseAndValidate() that calls threadState.remove() after each label is processed, or document the lifecycle contract.

Concern: Shared Configuration Fields Read Without Synchronization

Fields like resolver, useLabelSchema, skipProductValidation, userSchematronTransformers, configurations, userSchemaFiles, userSchematronFiles, and cachedEntityResolver are all read inside parseAndValidate() without any synchronization. The PR assumes they are only written during single-threaded setup. This assumption must be verified and documented — if any caller writes to these fields while validation is running, the results are undefined.

Concern: LabelUtil Static Methods Called from Unsynchronized Context

LabelUtil.setLocation() and LabelUtil.registerIMVersion() are static methods called from the now-unsynchronized parseAndValidate(). Per the CLAUDE.md for this project, LabelUtil maintains static state. If these methods are not internally thread-safe, concurrent validation will corrupt that state. Please verify.

Minor: ParserState Constructor Wraps ParserConfigurationException in RuntimeException

} catch (ParserConfigurationException e) {
    throw new RuntimeException("Failed to initialise per-thread ParserState", e);
}

The old code threw ParserConfigurationException (checked), which callers could handle. This is now a RuntimeException that will crash the validation thread without an informative diagnostic if parser initialization fails. This is an acceptable tradeoff for ThreadLocal.withInitial, but it should be noted in the Javadoc.

Missing: Concurrent Integration Test

The existing test suite runs sequentially. The entire value of this change is multi-threaded throughput. Before relying on this in production, add a test that validates multiple labels from a thread pool concurrently and verifies both correctness and the absence of ConcurrentModificationException. Without this, the cachedLabelSchematrons race will not be caught by CI.

Summary

The ThreadLocal refactor is the right direction, but the cachedLabelSchematrons race and setCachedLSResourceResolver behavior change are real bugs that will cause failures in production use. Please address those two before merge.

- Change cachedLabelSchematrons to ConcurrentHashMap + computeIfAbsent()
- Make CachedLSResourceResolver.handler ThreadLocal to fix setProblemHandler race
- Add sharedCachedLSResolver field with propagation in createParserIfNeeded
- Add AtomicLong generation counter so clear() invalidates all threads' ParserState
- Initialize skipProductValidation in clear()
- Add comprehensive thread-safety Javadoc (class-level + setCachedLSResourceResolver)
- Document that LabelUtil static methods are internally synchronized
- Add LabelValidatorConcurrencyTest with local schemas (no network I/O)

Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
devin-ai-integration[bot]

This comment was marked as resolved.

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Copy link
Copy Markdown

@jordanpadams jordanpadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-Review: Remove synchronized Bottleneck from LabelValidator via ThreadLocal

All six concurrency bugs from the initial review have been addressed. The generation counter approach for clear() is particularly well-designed. The concurrent test is a solid addition. A few remaining items below.

Remaining Bug: CachedLSResourceResolver.cachedEntities Is Still a Plain HashMap

This is explicitly flagged in the PR's own checklist, and it's now more urgent than before. With sharedCachedLSResolver in place, a single CachedLSResourceResolver instance is now explicitly shared across threads via ps.cachedLSResolver = sharedCachedLSResolver. When multiple threads call resolveResource() concurrently on a shared instance, they hit unsynchronized containsKey + put on the internal cachedEntities HashMap — a real data race:

// CachedLSResourceResolver.resolveResource() — called concurrently on shared instance:
if (!cachedEntities.containsKey(systemId)) {  // ← unsynchronized read
    ...
    cachedEntities.put(systemId, entity);      // ← unsynchronized write
}

This was pre-existing but was previously hidden behind synchronized on parseAndValidate. Now it is exposed. Change cachedEntities to ConcurrentHashMap and merge the containsKey + put into computeIfAbsent.

Minor Bug: Dead Code in loadLabelSchematrons Catch Block

After the computeIfAbsent refactor, TransformerException is always wrapped in a RuntimeException inside the lambda. The outer catch (TransformerException te) block is therefore unreachable — all TransformerException instances now arrive as RuntimeException via the RuntimeException re branch. This is dead code and should be removed to avoid confusion.

} catch (RuntimeException re) {
    // Unwrap exceptions from computeIfAbsent lambda
    ...
} catch (TransformerException te) {   // ← dead, never reached
    ...
}

Minor: Concurrent Test Coverage Is Still Narrow

The new LabelValidatorConcurrencyTest runs with schematronCheck=false and skipProductValidation=true, so it exercises none of the schematron loading, loadLabelSchematrons, or CachedLSResourceResolver.resolveResource() code paths under concurrency. For production confidence, a follow-on test with schematronCheck=true against a local schematron file would be valuable — especially once cachedEntities is fixed.

Confirmed Fixed

  • cachedLabelSchematronsConcurrentHashMap + computeIfAbsent + volatile
  • CachedLSResourceResolver.handlerThreadLocal<ProblemHandler>
  • setCachedLSResourceResolver()volatile sharedCachedLSResolver propagated in createParserIfNeeded
  • clear() one-thread limitation → AtomicLong configGeneration counter with stale-state detection in parseAndValidate
  • skipProductValidation not reset in clear() → fixed ✓
  • Thread-safety contract documented in class-level Javadoc ✓
  • Concurrent integration test added ✓

Summary

Fix CachedLSResourceResolver.cachedEntities (the one remaining real data race), remove the dead catch (TransformerException) block, and this is ready to merge. Good work addressing all the original issues.

devin-ai-integration bot and others added 2 commits April 2, 2026 13:17
- CachedLSResourceResolver.cachedEntities: HashMap -> ConcurrentHashMap + putIfAbsent
- Remove dead catch (TransformerException) block in loadLabelSchematrons (now wrapped in RuntimeException via computeIfAbsent lambda)
- Add concurrentParseAndValidate_withSchematron test exercising schematron loading and CachedLSResourceResolver under concurrent access

Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
…el-validator' into devin/1775083777-threadlocal-label-validator

Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
devin-ai-integration[bot]

This comment was marked as resolved.

@jordanpadams
Copy link
Copy Markdown

Remaining Issue 1: CachedLSResourceResolver.cachedEntities is a plain HashMap accessed concurrently

With sharedCachedLSResolver now explicitly sharing a single CachedLSResourceResolver instance across threads, multiple threads can call resolveResource() concurrently on the same instance. The internal cachedEntities is still a plain HashMap:

// Concurrent threads hit this without synchronization:
if (!cachedEntities.containsKey(systemId)) {   // unsynchronized read
    ...
    cachedEntities.put(systemId, entity);       // unsynchronized write
}

This is a real data race that can cause ConcurrentModificationException or silent cache corruption. Fix by changing cachedEntities to ConcurrentHashMap and replacing the containsKey + put pattern with computeIfAbsent.

Remaining Issue 2: Dead catch (TransformerException te) block in loadLabelSchematrons

After the computeIfAbsent refactor, all checked exceptions thrown inside the lambda (including TransformerException) are wrapped in a RuntimeException. The outer catch (RuntimeException re) block now handles all of them. The subsequent catch (TransformerException te) is therefore unreachable dead code:

} catch (RuntimeException re) {
    // Unwrap exceptions from computeIfAbsent lambda — handles TransformerException too
    ...
} catch (TransformerException te) {   // ← dead, never reached
    ...
}

Remove the dead block to avoid confusion about which path actually handles schematron errors.

@devin-ai-integration
Copy link
Copy Markdown
Author

Both issues are already fixed in the latest push (edf28b7d):

  1. cachedEntities data race — Already changed to ConcurrentHashMap (field initializer at line 139 + constructor at line 161). The containsKey+put pattern was replaced with get() + putIfAbsent() (lines 172, 202). I used get()+putIfAbsent() instead of computeIfAbsent() because the fetch logic can fail and return null on error — ConcurrentHashMap.computeIfAbsent() throws NPE if the mapping function returns null.

  2. Dead catch (TransformerException) block — Already removed in the same commit. The catch (RuntimeException re) block now unwraps TransformerException from re.getCause() and handles it directly.

Also added concurrentParseAndValidate_withSchematron() test that exercises both code paths under concurrent access. CI is green (4/4 checks pass).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant