Join table-structure detections with OCR to restore markdown cell con… by edknv · Pull Request #1870 · NVIDIA/NeMo-Retriever

edknv · 2026-04-17T00:35:26Z

…tent

Description

A previous commit correctly removed OCR from the table-structure stage to avoid running it twice, but left use_table_structure=True pipelines producing empty markdown skeletons. The separate OCR stage was also gated off for tables in that config, so no stage produced the per-crop word boxes the existing join_table_structure_and_ocr_output helper needs, and the helper had no caller.

This PR reconnects them so the OCR stage now always runs on table crops. When use_table_structure=True it matches each crop's bbox against the page's regions and calls the existing join helper, falling back to pseudo-markdown on no match.

Before

===== table: src=/home/edwardk/git/nv-ingest/data/multimodal_test.pdf_1 page=1 idx=0 =====
|  |  |  |
| --- | --- | --- |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
============================================================

===== table: src=/home/edwardk/git/nv-ingest/data/multimodal_test.pdf_2 page=2 idx=0 =====
|  |  |  |  |
| --- | --- | --- | --- |
|  |  |  |  |
|  |  |  |  |
|  |  |  |  |
|  |  |  |  |
|  |  |  |  |
|  |  |  |  |
|  |  |  |  |
============================================================

After

===== table: src=/home/edwardk/git/nv-ingest/data/multimodal_test.pdf_1 page=1 idx=0 =====
| Animal | Activity | Place |
| --- | --- | --- |
| Giraffe | Driving a car | At the beach |
| Lion | Putting on sunscreen | At the park |
| Cat | Jumping onto a laptop | In a home office |
| Dog | Chasing a squirrel | In the front yard |
============================================================

===== table: src=/home/edwardk/git/nv-ingest/data/multimodal_test.pdf_2 page=2 idx=0 =====
smoke test to ensure extraction is working as intended. This will be used in Cl time that changes make the library do negatively impact on to ensure we to not
| Car | Color1 | Color2 | Color3 |
| --- | --- | --- | --- |
| Coupe | White | Silver | Flat Gray |
| Sedan | White | Metallic Gray | Matte Gray |
| Minivan | Gray | Beige | Black |
| Truck | Dark Gray | Titanium Gray | Charcoal |
| Convertible | Light Gray | Graphite | Slate Gray |
============================================================

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

…tent

greptile-apps · 2026-04-17T00:39:35Z

Greptile Summary

This PR reconnects the table-structure and OCR stages after a previous commit decoupled them and left use_table_structure=True pipelines producing empty markdown skeletons. The table-structure stage now publishes a table_structure_v1 column containing per-crop detections and crop dimensions; the OCR stage reads this column, matches each table crop by bbox, and calls the existing join_table_structure_and_ocr_output helper to produce populated markdown, with graceful pseudo-markdown fallback on no match. The over-aggressive _trim_non_table_edge_rows heuristic that was discarding legitimate header rows is also removed. Test coverage is solid — local path, remote path, and no-match fallback are all exercised.

Confidence Score: 5/5

Safe to merge — the regression is fully fixed, all three test paths pass, and the one P2 finding degrades gracefully via existing getattr guards.

All remaining findings are P2: the actor call error paths don't include table_structure_v1, but the OCR stage already handles missing columns with a safe fallback. The core logic — new inter-stage column, bbox matching, join/fallback — is correct and well-tested.

gpu_actor.py and cpu_actor.py error fallback paths (minor schema inconsistency, no functional impact)

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/table/shared.py	Adds `table_structure_v1` inter-stage column containing per-crop detections and crop dimensions; used by OCR stage for bbox-matching. Pass 3 now extracts `crop_hw` and `counts` once and shares them between `table_items` and the new `ts_regions` list.
nemo_retriever/src/nemo_retriever/ocr/shared.py	Adds `use_table_structure` parameter to `ocr_page_elements`; introduces `_find_ts_detections_for_bbox` to match table crop bboxes against `table_structure_v1` regions (1e-4 tolerance); joins structure detections with OCR output via `join_table_structure_and_ocr_output`, falling back to pseudo-markdown on no match.
nemo_retriever/src/nemo_retriever/utils/table_and_chart.py	Removes `_trim_non_table_edge_rows` (confirmed unused elsewhere) and its call-site — was over-aggressively trimming legitimate header/footer rows from extracted tables.
nemo_retriever/src/nemo_retriever/table/gpu_actor.py	Removes unused `ocr_invoke_url` parameter. The `__call__` error fallback doesn't include the new `table_structure_v1` column, creating schema inconsistency on rare catastrophic failure (functionally safe due to graceful OCR-stage fallback).
nemo_retriever/src/nemo_retriever/table/cpu_actor.py	Mirrors `gpu_actor.py`: removes `ocr_invoke_url`, same `table_structure_v1`-absent error path.
nemo_retriever/tests/test_table_structure.py	Adds `TestOCRJoinsTableStructure` with local, no-match fallback, and remote path tests. Updates existing tests to assert `table_structure_v1` column shape. Removes now-deleted `ocr_invoke_url` assertions from config tests.
nemo_retriever/src/nemo_retriever/graph/multi_type_extract_operator.py	OCR stage now always runs for table crops regardless of `use_table_structure`; flag forwarded into `ocr_kwargs` so the OCR actor can decide whether to join or fall back to pseudo-markdown.
nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py	Matches the `multi_type_extract_operator.py` change: removes `not use_table_structure` guard on `extract_tables` OCR kwarg and forwards `use_table_structure` into `ocr_kwargs`.
nemo_retriever/src/nemo_retriever/table/config.py	Removes `ocr_invoke_url` from `TableStructureOCRStageConfig` and its loader — OCR URL is no longer managed by the table-structure stage.
nemo_retriever/src/nemo_retriever/ocr/gpu_ocr.py	Adds `use_table_structure` kwarg normalization to match the existing pattern for other flags.
nemo_retriever/src/nemo_retriever/ocr/cpu_ocr.py	Same `use_table_structure` kwarg normalization as `gpu_ocr.py`.

Sequence Diagram

sequenceDiagram
    participant PE as PageElementDetectionActor
    participant TS as TableStructureActor
    participant OCR as OCRActor

    PE->>TS: batch_df (page_elements_v3)
    Note over TS: Detect cell/row/col structure per table crop
    TS->>OCR: batch_df + table (skeleton text) + table_structure_v1 regions

    loop for each table crop
        OCR->>OCR: _find_ts_detections_for_bbox(row, bbox)
        alt structure match found
            OCR->>OCR: join_table_structure_and_ocr_output(ts_dets, preds, crop_hw)
            Note over OCR: Populated markdown table
        else no match or empty detections
            OCR->>OCR: _blocks_to_pseudo_markdown(blocks)
            Note over OCR: Pseudo-markdown fallback
        end
    end
    OCR->>OCR: out[table] = joined results

Prompt To Fix All With AI

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/table/gpu_actor.py
Line: 86-87

Comment:
**Error path omits the new `table_structure_v1` column**

Both actor `__call__` error paths (this file and `cpu_actor.py`) were written before this PR added `table_structure_v1`. When the outer `BaseException` catch fires, the returned DataFrame has `table` and `table_structure_ocr_v1` but no `table_structure_v1`. The OCR stage handles missing columns safely via `getattr(row, "table_structure_v1", None)`, so the fallback to pseudo-markdown still works — but the schema is inconsistent with the success path and could confuse downstream consumers or schema-aware Ray Data stages.

The same fix applies to `cpu_actor.py` (same block at lines 85–86):

```python
out["table"] = [[] for _ in range(n)]
out["table_structure_v1"] = [
    {"regions": [], "timing": None, "error": payload["error"]} for _ in range(n)
]
out["table_structure_ocr_v1"] = [payload for _ in range(n)]
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (3): Last reviewed commit: "remove _trim_non_table_edge_rows" | Re-trigger Greptile}

jioffe502

Yes! I found this in my testing as well.

Thank you for this fix

Join table-structure detections with OCR to restore markdown cell con…

ae3ace1

…tent

edknv requested review from a team as code owners April 17, 2026 00:35

edknv requested a review from jdye64 April 17, 2026 00:35

greptile-apps bot reviewed Apr 17, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/table/gpu_actor.py Outdated

Comment thread nemo_retriever/tests/test_table_structure.py

edknv added 4 commits April 16, 2026 17:47

Remove ocr_invoke_url from table structure actors

89f4d64

remove more ocr_invoke_url from table-structure

bb3930d

add unit tests

a8f71e3

remove _trim_non_table_edge_rows

a4c966d

jioffe502 approved these changes Apr 17, 2026

View reviewed changes

jdye64 approved these changes Apr 17, 2026

View reviewed changes

jdye64 merged commit e97d7cc into NVIDIA:main Apr 17, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Join table-structure detections with OCR to restore markdown cell con…#1870

Join table-structure detections with OCR to restore markdown cell con…#1870
jdye64 merged 5 commits intoNVIDIA:mainfrom
edknv:edwardk/fix-table-structure

edknv commented Apr 17, 2026

Uh oh!

greptile-apps bot commented Apr 17, 2026 •

edited

Loading

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

jioffe502 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

edknv commented Apr 17, 2026

Description

Checklist

Uh oh!

greptile-apps bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

jioffe502 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Apr 17, 2026 •

edited

Loading