Skip to content

Commit 1df0560

Browse files
authored
fix: Clear word/char cells when force_full_page_ocr is used (#2738)
* fix: Clear word/char cells when force_full_page_ocr is used When force_full_page_ocr=True, the OCR model correctly replaces textline_cells with OCR-extracted text. However, word_cells and char_cells were not cleared, causing downstream components like TableStructureModel to use unreliable PDF-extracted text containing GLYPH artifacts (e.g., GLYPH<c=1,font=/AAAAAH+font000000002ed64673>). This fix clears word_cells and char_cells when force_full_page_ocr is enabled, ensuring TableStructureModel falls back to the OCR- extracted textline cells via its existing fallback logic. Fixes issue where PDFs with problematic fonts (Type3, missing ToUnicode CMap) produced GLYPH artifacts in table content despite force_full_page_ocr being triggered. * fix: Filter out PDF-extracted word/char cells when force_full_page_ocr is used When force_full_page_ocr=True, the OCR model correctly replaces textline_cells with OCR-extracted text. However, word_cells and char_cells from the PDF backend were not handled, causing downstream components like TableStructureModel to use unreliable PDF-extracted text containing GLYPH artifacts. Instead of clearing all word/char cells (which would be destructive for backends like mets_gbs that provide OCR-generated word cells), this fix filters out only cells where from_ocr=False, preserving any OCR-generated cells. This ensures TableStructureModel falls back to the OCR-extracted textline cells via its existing fallback logic when word_cells is empty or only contains OCR cells. Fixes issue where PDFs with problematic fonts (Type3, missing ToUnicode CMap) produced GLYPH artifacts in table content despite force_full_page_ocr being triggered. * DCO Remediation Commit for Myles McNamara <[email protected]> I, Myles McNamara <[email protected]>, hereby add my Signed-off-by to this commit: 4197a4e I, Myles McNamara <[email protected]>, hereby add my Signed-off-by to this commit: a4f4e3f Signed-off-by: Myles McNamara <[email protected]> --------- Signed-off-by: Myles McNamara <[email protected]>
1 parent edbabfc commit 1df0560

File tree

1 file changed

+14
-0
lines changed

1 file changed

+14
-0
lines changed

docling/models/base_ocr_model.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,20 @@ def post_process_cells(self, ocr_cells: List[TextCell], page: Page) -> None:
154154
page.parsed_page.textline_cells = final_cells
155155
page.parsed_page.has_lines = len(final_cells) > 0
156156

157+
# When force_full_page_ocr is used, PDF-extracted word/char cells are
158+
# unreliable. Filter out cells where from_ocr=False, keeping any OCR-
159+
# generated cells. This ensures downstream components (e.g., table
160+
# structure model) fall back to OCR-extracted textline cells.
161+
if self.options.force_full_page_ocr:
162+
page.parsed_page.word_cells = [
163+
c for c in page.parsed_page.word_cells if c.from_ocr
164+
]
165+
page.parsed_page.char_cells = [
166+
c for c in page.parsed_page.char_cells if c.from_ocr
167+
]
168+
page.parsed_page.has_words = len(page.parsed_page.word_cells) > 0
169+
page.parsed_page.has_chars = len(page.parsed_page.char_cells) > 0
170+
157171
def _combine_cells(
158172
self, existing_cells: List[TextCell], ocr_cells: List[TextCell]
159173
) -> List[TextCell]:

0 commit comments

Comments
 (0)