Releases · aws-samples/amazon-textract-textractor

Table linearization improvements by @Belval in #313
- Add .get_text(), .to_html() and .to_markdown() functions to Linearizable which is now implemented by Document, Page, DocumentEntity and EntityList
- Add HTMLLinearizationConfig and MarkdownLinearizationConfig as pre-configured TextLinearizationConfig
- Add the follow parameters to TextLinearizationConfig
  - duplicate_text_in_merged_cells duplicates the text in merge cells to preserve row-level alignment
  - table_flatten_headers combines multi-row headers into a single row, duplicating the merged cells horizontally as needed
  - table_tabulate_remove_extra_hyphens removes extra hyphens '-' in markdown tables to reduce context length
  - max_number_of_consecutive_spaces defines the maximum number of contiguous whitespace characters, similar to max_number_of_consecutive_new_lines
Fixes:
- Fix trailing whitespace in cell text
- Fix table_column_separator being hardcoded as '\t'
- Fix table_row_separator being hardcoded as '\n'
- Resets BytesIO buffer to 0 position by @abest0 in #310

New Contributors

@abest0 made their first contribution in #310

Full Changelog: v1.7.2...v1.7.3

Contributors

abest0 and Belval

Assets 2

09 Feb 20:22

Belval

v1.7.2

3398444

Version 1.7.2

What's Changed

Fix for page objects not always having an image attached, causing an exception on .visualize()

Full Changelog: v1.7.1...v1.7.2

Assets 2

31 Jan 21:33

Belval

v1.7.1

0a36560

Version 1.7.1

What's Changed

Fix issue where a table within a container layout could be duplicated in the .get_text() output.

Full Changelog: v1.7.0...v1.7.1

Assets 2

31 Jan 00:13

Belval

v1.7.0

1d72c89

Version 1.7.0

What's Changed

Loosen XlsxWriter version constraints by @mdscruggs in #292
Rework the linearization heuristic to ensure that no words are missing or duplicated
Fix KeyValues being assigned twice on overlapping table cells, going forward KVs inside a tables are ignored (table structure takes precedence)
Hardens parser code against missing children in layouts or KeyValues with missing keys
Fix markdown tables not having header rows when one of the cell is empty
Add support for Python 3.11 and 3.12 in the GitHub action workflows
Add textractor.__version__ to allow easier identification of the installed Textractor version in code
Added hide_table_layout
Remove amazon-textract-response-parser as a dependency as its use for validating the input schema could add +200 ms of latency in some cases. Textractor-only parsing takes <30ms.

Breaking changes

Remove linearize_table and linearize_key_value from TextLinearizationConfig as both were not used
Remove the s3_output_path parameter from analyze_expense as the API does not support outputting to S3

New Contributors

@mdscruggs made their first contribution in #292

Full Changelog: v1.6.1...v1.7.0

Contributors

mdscruggs

Assets 2

19 Dec 21:03

Belval

v1.6.1

884ddaa

Version 1.6.1

What's new

Fix bug in table to markdown

Full Changelog: v1.6.0...v1.6.1

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

What's Changed

What's Changed

Breaking changes

New Contributors

Contributors

What's new

Releases: aws-samples/amazon-textract-textractor

Version 1.7.8

What's Changed

Version 1.7.7

What's Changed

Contributors

Version 1.7.6

What's Changed

Contributors

Version 1.7.5

What's Changed

Contributors

Version 1.7.4

What's Changed

Contributors

Version 1.7.3

What's Changed

New Contributors

Contributors

Version 1.7.2

What's Changed

Version 1.7.1

What's Changed

Version 1.7.0

What's Changed

Breaking changes

New Contributors

Contributors

Version 1.6.1

What's new