Skip to content

Join table-structure detections with OCR to restore markdown cell con…#1870

Merged
jdye64 merged 5 commits intoNVIDIA:mainfrom
edknv:edwardk/fix-table-structure
Apr 17, 2026
Merged

Join table-structure detections with OCR to restore markdown cell con…#1870
jdye64 merged 5 commits intoNVIDIA:mainfrom
edknv:edwardk/fix-table-structure

Conversation

@edknv
Copy link
Copy Markdown
Collaborator

@edknv edknv commented Apr 17, 2026

…tent

Description

A previous commit correctly removed OCR from the table-structure stage to avoid running it twice, but left use_table_structure=True pipelines producing empty markdown skeletons. The separate OCR stage was also gated off for tables in that config, so no stage produced the per-crop word boxes the existing join_table_structure_and_ocr_output helper needs, and the helper had no caller.

This PR reconnects them so the OCR stage now always runs on table crops. When use_table_structure=True it matches each crop's bbox against the page's regions and calls the existing join helper, falling back to pseudo-markdown on no match.

Before

===== table: src=/home/edwardk/git/nv-ingest/data/multimodal_test.pdf_1 page=1 idx=0 =====
|  |  |  |
| --- | --- | --- |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
============================================================

===== table: src=/home/edwardk/git/nv-ingest/data/multimodal_test.pdf_2 page=2 idx=0 =====
|  |  |  |  |
| --- | --- | --- | --- |
|  |  |  |  |
|  |  |  |  |
|  |  |  |  |
|  |  |  |  |
|  |  |  |  |
|  |  |  |  |
|  |  |  |  |
============================================================

After

===== table: src=/home/edwardk/git/nv-ingest/data/multimodal_test.pdf_1 page=1 idx=0 =====
| Animal | Activity | Place |
| --- | --- | --- |
| Giraffe | Driving a car | At the beach |
| Lion | Putting on sunscreen | At the park |
| Cat | Jumping onto a laptop | In a home office |
| Dog | Chasing a squirrel | In the front yard |
============================================================

===== table: src=/home/edwardk/git/nv-ingest/data/multimodal_test.pdf_2 page=2 idx=0 =====
smoke test to ensure extraction is working as intended. This will be used in Cl time that changes make the library do negatively impact on to ensure we to not
| Car | Color1 | Color2 | Color3 |
| --- | --- | --- | --- |
| Coupe | White | Silver | Flat Gray |
| Sedan | White | Metallic Gray | Matte Gray |
| Minivan | Gray | Beige | Black |
| Truck | Dark Gray | Titanium Gray | Charcoal |
| Convertible | Light Gray | Graphite | Slate Gray |
============================================================

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@edknv edknv requested review from a team as code owners April 17, 2026 00:35
@edknv edknv requested a review from jdye64 April 17, 2026 00:35
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 17, 2026

Greptile Summary

This PR reconnects the table-structure and OCR stages after a previous commit decoupled them and left use_table_structure=True pipelines producing empty markdown skeletons. The table-structure stage now publishes a table_structure_v1 column containing per-crop detections and crop dimensions; the OCR stage reads this column, matches each table crop by bbox, and calls the existing join_table_structure_and_ocr_output helper to produce populated markdown, with graceful pseudo-markdown fallback on no match. The over-aggressive _trim_non_table_edge_rows heuristic that was discarding legitimate header rows is also removed. Test coverage is solid — local path, remote path, and no-match fallback are all exercised.

Confidence Score: 5/5

Safe to merge — the regression is fully fixed, all three test paths pass, and the one P2 finding degrades gracefully via existing getattr guards.

All remaining findings are P2: the actor call error paths don't include table_structure_v1, but the OCR stage already handles missing columns with a safe fallback. The core logic — new inter-stage column, bbox matching, join/fallback — is correct and well-tested.

gpu_actor.py and cpu_actor.py error fallback paths (minor schema inconsistency, no functional impact)

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/table/shared.py Adds table_structure_v1 inter-stage column containing per-crop detections and crop dimensions; used by OCR stage for bbox-matching. Pass 3 now extracts crop_hw and counts once and shares them between table_items and the new ts_regions list.
nemo_retriever/src/nemo_retriever/ocr/shared.py Adds use_table_structure parameter to ocr_page_elements; introduces _find_ts_detections_for_bbox to match table crop bboxes against table_structure_v1 regions (1e-4 tolerance); joins structure detections with OCR output via join_table_structure_and_ocr_output, falling back to pseudo-markdown on no match.
nemo_retriever/src/nemo_retriever/utils/table_and_chart.py Removes _trim_non_table_edge_rows (confirmed unused elsewhere) and its call-site — was over-aggressively trimming legitimate header/footer rows from extracted tables.
nemo_retriever/src/nemo_retriever/table/gpu_actor.py Removes unused ocr_invoke_url parameter. The __call__ error fallback doesn't include the new table_structure_v1 column, creating schema inconsistency on rare catastrophic failure (functionally safe due to graceful OCR-stage fallback).
nemo_retriever/src/nemo_retriever/table/cpu_actor.py Mirrors gpu_actor.py: removes ocr_invoke_url, same table_structure_v1-absent error path.
nemo_retriever/tests/test_table_structure.py Adds TestOCRJoinsTableStructure with local, no-match fallback, and remote path tests. Updates existing tests to assert table_structure_v1 column shape. Removes now-deleted ocr_invoke_url assertions from config tests.
nemo_retriever/src/nemo_retriever/graph/multi_type_extract_operator.py OCR stage now always runs for table crops regardless of use_table_structure; flag forwarded into ocr_kwargs so the OCR actor can decide whether to join or fall back to pseudo-markdown.
nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py Matches the multi_type_extract_operator.py change: removes not use_table_structure guard on extract_tables OCR kwarg and forwards use_table_structure into ocr_kwargs.
nemo_retriever/src/nemo_retriever/table/config.py Removes ocr_invoke_url from TableStructureOCRStageConfig and its loader — OCR URL is no longer managed by the table-structure stage.
nemo_retriever/src/nemo_retriever/ocr/gpu_ocr.py Adds use_table_structure kwarg normalization to match the existing pattern for other flags.
nemo_retriever/src/nemo_retriever/ocr/cpu_ocr.py Same use_table_structure kwarg normalization as gpu_ocr.py.

Sequence Diagram

sequenceDiagram
    participant PE as PageElementDetectionActor
    participant TS as TableStructureActor
    participant OCR as OCRActor

    PE->>TS: batch_df (page_elements_v3)
    Note over TS: Detect cell/row/col structure per table crop
    TS->>OCR: batch_df + table (skeleton text) + table_structure_v1 regions

    loop for each table crop
        OCR->>OCR: _find_ts_detections_for_bbox(row, bbox)
        alt structure match found
            OCR->>OCR: join_table_structure_and_ocr_output(ts_dets, preds, crop_hw)
            Note over OCR: Populated markdown table
        else no match or empty detections
            OCR->>OCR: _blocks_to_pseudo_markdown(blocks)
            Note over OCR: Pseudo-markdown fallback
        end
    end
    OCR->>OCR: out[table] = joined results
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/table/gpu_actor.py
Line: 86-87

Comment:
**Error path omits the new `table_structure_v1` column**

Both actor `__call__` error paths (this file and `cpu_actor.py`) were written before this PR added `table_structure_v1`. When the outer `BaseException` catch fires, the returned DataFrame has `table` and `table_structure_ocr_v1` but no `table_structure_v1`. The OCR stage handles missing columns safely via `getattr(row, "table_structure_v1", None)`, so the fallback to pseudo-markdown still works — but the schema is inconsistent with the success path and could confuse downstream consumers or schema-aware Ray Data stages.

The same fix applies to `cpu_actor.py` (same block at lines 85–86):

```python
out["table"] = [[] for _ in range(n)]
out["table_structure_v1"] = [
    {"regions": [], "timing": None, "error": payload["error"]} for _ in range(n)
]
out["table_structure_ocr_v1"] = [payload for _ in range(n)]
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (3): Last reviewed commit: "remove _trim_non_table_edge_rows" | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/table/gpu_actor.py Outdated
Comment thread nemo_retriever/tests/test_table_structure.py
Copy link
Copy Markdown
Collaborator

@jioffe502 jioffe502 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I found this in my testing as well.

Thank you for this fix

@jdye64 jdye64 merged commit e97d7cc into NVIDIA:main Apr 17, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants