You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+4-1Lines changed: 4 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,6 +40,8 @@ print(doc.text)
40
40
print(doc._.layout)
41
41
# Tables in the document and their extracted data
42
42
print(doc._.tables)
43
+
# Markdown representation of the document
44
+
print(doc._.markdown)
43
45
44
46
# Layout spans for different sections
45
47
for span in doc.spans["layout"]:
@@ -114,6 +116,7 @@ for span in doc.spans["layout"]:
114
116
|`Doc._.layout`|`DocLayout`| Layout features of the document. |
115
117
|`Doc._.pages`|`list[tuple[PageLayout, list[Span]]]`| Pages in the document and the spans they contain. |
116
118
|`Doc._.tables`|`list[Span]`| All tables in the document. |
119
+
|`Doc._.markdown`|`str`| Markdown representation of the document. |
117
120
|`Doc.spans["layout"]`|`spacy.tokens.SpanGroup`| The layout spans in the document. |
118
121
|`Span.label_`|`str`| The type of the extracted layout span, e.g. `"text"` or `"section_header"`. [See here](https://github.com/DS4SD/docling-core/blob/14cad33ae7f8dc011a79dd364361d2647c635466/docling_core/types/doc/labels.py) for options. |
119
122
|`Span.label`|`int`| The integer ID of the span label. |
@@ -161,7 +164,7 @@ layout = spaCyLayout(nlp)
161
164
| --- | --- | --- |
162
165
|`nlp`|`spacy.language.Language`| The initialized `nlp` object to use for tokenization. |
163
166
|`separator`|`str`| Token used to separate sections in the created `Doc` object. The separator won't be part of the layout span. If `None`, no separator will be added. Defaults to `"\n\n"`. |
164
-
|`attrs`|`dict[str, str]`| Override the custom spaCy attributes. Can include `"doc_layout"`, `"doc_pages"`, `"doc_tables"`, `"span_layout"`, `"span_data"`, `"span_heading"` and `"span_group"`. |
167
+
|`attrs`|`dict[str, str]`| Override the custom spaCy attributes. Can include `"doc_layout"`, `"doc_pages"`, `"doc_tables"`, `"doc_markdown"`, `"span_layout"`, `"span_data"`, `"span_heading"` and `"span_group"`. |
165
168
|`headings`|`list[str]`| Labels of headings to consider for `Span._.heading` detection. Defaults to `["section_header", "page_header", "title"]`. |
166
169
|`display_table`|`Callable[[pandas.DataFrame], str] \| str`| Function to generate the text-based representation of the table in the `Doc.text` or placeholder text. Defaults to `"TABLE"`. |
167
170
|`docling_options`|`dict[InputFormat, FormatOption]`|[Format options](https://ds4sd.github.io/docling/usage/#advanced-options) passed to Docling's `DocumentConverter`. |
0 commit comments