Releases: aws-samples/amazon-textract-textractor
Releases · aws-samples/amazon-textract-textractor
Version 1.7.8
Version 1.7.7
What's Changed
Full Changelog: v1.7.6...v1.7.7
Version 1.7.6
Version 1.7.5
What's Changed
- Make KeyValue.key an EntityList by @Belval in #320
- Remove numpy from explicit dependencies by @Belval in #324
- Hide key value layouts by @Belval in #325
- Return query and query answer with get_text() by @Belval in #329
- Convert image to RGB in EntityList for Jupyter compatibility by @Belval in #330
- Support for Python 3.12 by @tb102122 in #311
Full Changelog: v1.7.4...v1.7.5
Version 1.7.4
What's Changed
- Fix table title .get_text() by @Belval in #314
- Fix .to_pandas() raising an exception by @Belval in #315
Full Changelog: v1.7.3...v1.7.4
Version 1.7.3
What's Changed
-
Table linearization improvements by @Belval in #313
- Add
.get_text()
,.to_html()
and.to_markdown()
functions toLinearizable
which is now implemented byDocument
,Page
,DocumentEntity
andEntityList
- Add
HTMLLinearizationConfig
andMarkdownLinearizationConfig
as pre-configuredTextLinearizationConfig
- Add the follow parameters to
TextLinearizationConfig
duplicate_text_in_merged_cells
duplicates the text in merge cells to preserve row-level alignmenttable_flatten_headers
combines multi-row headers into a single row, duplicating the merged cells horizontally as neededtable_tabulate_remove_extra_hyphens
removes extra hyphens '-' in markdown tables to reduce context lengthmax_number_of_consecutive_spaces
defines the maximum number of contiguous whitespace characters, similar tomax_number_of_consecutive_new_lines
- Add
-
Fixes:
New Contributors
Full Changelog: v1.7.2...v1.7.3
Version 1.7.2
What's Changed
- Fix for page objects not always having an image attached, causing an exception on
.visualize()
Full Changelog: v1.7.1...v1.7.2
Version 1.7.1
What's Changed
- Fix issue where a table within a container layout could be duplicated in the
.get_text()
output.
Full Changelog: v1.7.0...v1.7.1
Version 1.7.0
What's Changed
- Loosen XlsxWriter version constraints by @mdscruggs in #292
- Rework the linearization heuristic to ensure that no words are missing or duplicated
- Fix KeyValues being assigned twice on overlapping table cells, going forward KVs inside a tables are ignored (table structure takes precedence)
- Hardens parser code against missing children in layouts or KeyValues with missing keys
- Fix markdown tables not having header rows when one of the cell is empty
- Add support for Python 3.11 and 3.12 in the GitHub action workflows
- Add
textractor.__version__
to allow easier identification of the installed Textractor version in code - Added hide_table_layout
- Remove amazon-textract-response-parser as a dependency as its use for validating the input schema could add +200 ms of latency in some cases. Textractor-only parsing takes <30ms.
Breaking changes
- Remove
linearize_table
andlinearize_key_value
fromTextLinearizationConfig
as both were not used - Remove the
s3_output_path
parameter fromanalyze_expense
as the API does not support outputting to S3
New Contributors
- @mdscruggs made their first contribution in #292
Full Changelog: v1.6.1...v1.7.0