Join table-structure detections with OCR to restore markdown cell con…#1870
Join table-structure detections with OCR to restore markdown cell con…#1870jdye64 merged 5 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR reconnects the table-structure and OCR stages after a previous commit decoupled them and left
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/table/shared.py | Adds table_structure_v1 inter-stage column containing per-crop detections and crop dimensions; used by OCR stage for bbox-matching. Pass 3 now extracts crop_hw and counts once and shares them between table_items and the new ts_regions list. |
| nemo_retriever/src/nemo_retriever/ocr/shared.py | Adds use_table_structure parameter to ocr_page_elements; introduces _find_ts_detections_for_bbox to match table crop bboxes against table_structure_v1 regions (1e-4 tolerance); joins structure detections with OCR output via join_table_structure_and_ocr_output, falling back to pseudo-markdown on no match. |
| nemo_retriever/src/nemo_retriever/utils/table_and_chart.py | Removes _trim_non_table_edge_rows (confirmed unused elsewhere) and its call-site — was over-aggressively trimming legitimate header/footer rows from extracted tables. |
| nemo_retriever/src/nemo_retriever/table/gpu_actor.py | Removes unused ocr_invoke_url parameter. The __call__ error fallback doesn't include the new table_structure_v1 column, creating schema inconsistency on rare catastrophic failure (functionally safe due to graceful OCR-stage fallback). |
| nemo_retriever/src/nemo_retriever/table/cpu_actor.py | Mirrors gpu_actor.py: removes ocr_invoke_url, same table_structure_v1-absent error path. |
| nemo_retriever/tests/test_table_structure.py | Adds TestOCRJoinsTableStructure with local, no-match fallback, and remote path tests. Updates existing tests to assert table_structure_v1 column shape. Removes now-deleted ocr_invoke_url assertions from config tests. |
| nemo_retriever/src/nemo_retriever/graph/multi_type_extract_operator.py | OCR stage now always runs for table crops regardless of use_table_structure; flag forwarded into ocr_kwargs so the OCR actor can decide whether to join or fall back to pseudo-markdown. |
| nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py | Matches the multi_type_extract_operator.py change: removes not use_table_structure guard on extract_tables OCR kwarg and forwards use_table_structure into ocr_kwargs. |
| nemo_retriever/src/nemo_retriever/table/config.py | Removes ocr_invoke_url from TableStructureOCRStageConfig and its loader — OCR URL is no longer managed by the table-structure stage. |
| nemo_retriever/src/nemo_retriever/ocr/gpu_ocr.py | Adds use_table_structure kwarg normalization to match the existing pattern for other flags. |
| nemo_retriever/src/nemo_retriever/ocr/cpu_ocr.py | Same use_table_structure kwarg normalization as gpu_ocr.py. |
Sequence Diagram
sequenceDiagram
participant PE as PageElementDetectionActor
participant TS as TableStructureActor
participant OCR as OCRActor
PE->>TS: batch_df (page_elements_v3)
Note over TS: Detect cell/row/col structure per table crop
TS->>OCR: batch_df + table (skeleton text) + table_structure_v1 regions
loop for each table crop
OCR->>OCR: _find_ts_detections_for_bbox(row, bbox)
alt structure match found
OCR->>OCR: join_table_structure_and_ocr_output(ts_dets, preds, crop_hw)
Note over OCR: Populated markdown table
else no match or empty detections
OCR->>OCR: _blocks_to_pseudo_markdown(blocks)
Note over OCR: Pseudo-markdown fallback
end
end
OCR->>OCR: out[table] = joined results
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/table/gpu_actor.py
Line: 86-87
Comment:
**Error path omits the new `table_structure_v1` column**
Both actor `__call__` error paths (this file and `cpu_actor.py`) were written before this PR added `table_structure_v1`. When the outer `BaseException` catch fires, the returned DataFrame has `table` and `table_structure_ocr_v1` but no `table_structure_v1`. The OCR stage handles missing columns safely via `getattr(row, "table_structure_v1", None)`, so the fallback to pseudo-markdown still works — but the schema is inconsistent with the success path and could confuse downstream consumers or schema-aware Ray Data stages.
The same fix applies to `cpu_actor.py` (same block at lines 85–86):
```python
out["table"] = [[] for _ in range(n)]
out["table_structure_v1"] = [
{"regions": [], "timing": None, "error": payload["error"]} for _ in range(n)
]
out["table_structure_ocr_v1"] = [payload for _ in range(n)]
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (3): Last reviewed commit: "remove _trim_non_table_edge_rows" | Re-trigger Greptile
jioffe502
left a comment
There was a problem hiding this comment.
Yes! I found this in my testing as well.
Thank you for this fix
…tent
Description
A previous commit correctly removed OCR from the table-structure stage to avoid running it twice, but left
use_table_structure=Truepipelines producing empty markdown skeletons. The separate OCR stage was also gated off for tables in that config, so no stage produced the per-crop word boxes the existingjoin_table_structure_and_ocr_outputhelper needs, and the helper had no caller.This PR reconnects them so the OCR stage now always runs on table crops. When
use_table_structure=Trueit matches each crop's bbox against the page's regions and calls the existing join helper, falling back to pseudo-markdown on no match.Before
After
Checklist