Commit 1df0560
authored
fix: Clear word/char cells when force_full_page_ocr is used (#2738)
* fix: Clear word/char cells when force_full_page_ocr is used
When force_full_page_ocr=True, the OCR model correctly replaces
textline_cells with OCR-extracted text. However, word_cells and
char_cells were not cleared, causing downstream components like
TableStructureModel to use unreliable PDF-extracted text containing
GLYPH artifacts (e.g., GLYPH<c=1,font=/AAAAAH+font000000002ed64673>).
This fix clears word_cells and char_cells when force_full_page_ocr
is enabled, ensuring TableStructureModel falls back to the OCR-
extracted textline cells via its existing fallback logic.
Fixes issue where PDFs with problematic fonts (Type3, missing
ToUnicode CMap) produced GLYPH artifacts in table content despite
force_full_page_ocr being triggered.
* fix: Filter out PDF-extracted word/char cells when force_full_page_ocr is used
When force_full_page_ocr=True, the OCR model correctly replaces
textline_cells with OCR-extracted text. However, word_cells and
char_cells from the PDF backend were not handled, causing downstream
components like TableStructureModel to use unreliable PDF-extracted
text containing GLYPH artifacts.
Instead of clearing all word/char cells (which would be destructive
for backends like mets_gbs that provide OCR-generated word cells),
this fix filters out only cells where from_ocr=False, preserving any
OCR-generated cells.
This ensures TableStructureModel falls back to the OCR-extracted
textline cells via its existing fallback logic when word_cells is
empty or only contains OCR cells.
Fixes issue where PDFs with problematic fonts (Type3, missing
ToUnicode CMap) produced GLYPH artifacts in table content despite
force_full_page_ocr being triggered.
* DCO Remediation Commit for Myles McNamara <[email protected]>
I, Myles McNamara <[email protected]>, hereby add my Signed-off-by to this commit: 4197a4e
I, Myles McNamara <[email protected]>, hereby add my Signed-off-by to this commit: a4f4e3f
Signed-off-by: Myles McNamara <[email protected]>
---------
Signed-off-by: Myles McNamara <[email protected]>1 parent edbabfc commit 1df0560
1 file changed
+14
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
154 | 154 | | |
155 | 155 | | |
156 | 156 | | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
157 | 171 | | |
158 | 172 | | |
159 | 173 | | |
| |||
0 commit comments