Skip to content

Releases: aws-samples/amazon-textract-textractor

Version 1.7.8

21 Mar 12:21
Compare
Choose a tag to compare

What's Changed

  • Handle None Relationships when parsing LAYOUT_FIGURE

Full Changelog: v1.7.7...v1.7.8

Version 1.7.7

20 Mar 16:45
Compare
Choose a tag to compare

What's Changed

  • Handle None bounding box when parsing Queries by @Belval in #340

Full Changelog: v1.7.6...v1.7.7

Version 1.7.6

15 Mar 20:39
Compare
Choose a tag to compare

What's Changed

Full Changelog: v1.7.5...v1.7.6

Version 1.7.5

07 Mar 23:34
Compare
Choose a tag to compare

What's Changed

  • Make KeyValue.key an EntityList by @Belval in #320
  • Remove numpy from explicit dependencies by @Belval in #324
  • Hide key value layouts by @Belval in #325
  • Return query and query answer with get_text() by @Belval in #329
  • Convert image to RGB in EntityList for Jupyter compatibility by @Belval in #330
  • Support for Python 3.12 by @tb102122 in #311

Full Changelog: v1.7.4...v1.7.5

Version 1.7.4

26 Feb 15:46
Compare
Choose a tag to compare

What's Changed

Full Changelog: v1.7.3...v1.7.4

Version 1.7.3

26 Feb 12:39
c5120b0
Compare
Choose a tag to compare

What's Changed

  • Table linearization improvements by @Belval in #313

    • Add .get_text(), .to_html() and .to_markdown() functions to Linearizable which is now implemented by Document, Page, DocumentEntity and EntityList
    • Add HTMLLinearizationConfig and MarkdownLinearizationConfig as pre-configured TextLinearizationConfig
    • Add the follow parameters to TextLinearizationConfig
      • duplicate_text_in_merged_cells duplicates the text in merge cells to preserve row-level alignment
      • table_flatten_headers combines multi-row headers into a single row, duplicating the merged cells horizontally as needed
      • table_tabulate_remove_extra_hyphens removes extra hyphens '-' in markdown tables to reduce context length
      • max_number_of_consecutive_spaces defines the maximum number of contiguous whitespace characters, similar to max_number_of_consecutive_new_lines
  • Fixes:

    • Fix trailing whitespace in cell text
    • Fix table_column_separator being hardcoded as '\t'
    • Fix table_row_separator being hardcoded as '\n'
    • Resets BytesIO buffer to 0 position by @abest0 in #310

New Contributors

Full Changelog: v1.7.2...v1.7.3

Version 1.7.2

09 Feb 20:22
Compare
Choose a tag to compare

What's Changed

  • Fix for page objects not always having an image attached, causing an exception on .visualize()

Full Changelog: v1.7.1...v1.7.2

Version 1.7.1

31 Jan 21:33
Compare
Choose a tag to compare

What's Changed

  • Fix issue where a table within a container layout could be duplicated in the .get_text() output.

Full Changelog: v1.7.0...v1.7.1

Version 1.7.0

31 Jan 00:13
Compare
Choose a tag to compare

What's Changed

  • Loosen XlsxWriter version constraints by @mdscruggs in #292
  • Rework the linearization heuristic to ensure that no words are missing or duplicated
  • Fix KeyValues being assigned twice on overlapping table cells, going forward KVs inside a tables are ignored (table structure takes precedence)
  • Hardens parser code against missing children in layouts or KeyValues with missing keys
  • Fix markdown tables not having header rows when one of the cell is empty
  • Add support for Python 3.11 and 3.12 in the GitHub action workflows
  • Add textractor.__version__ to allow easier identification of the installed Textractor version in code
  • Added hide_table_layout
  • Remove amazon-textract-response-parser as a dependency as its use for validating the input schema could add +200 ms of latency in some cases. Textractor-only parsing takes <30ms.

Breaking changes

  • Remove linearize_table and linearize_key_value from TextLinearizationConfig as both were not used
  • Remove the s3_output_path parameter from analyze_expense as the API does not support outputting to S3

New Contributors

Full Changelog: v1.6.1...v1.7.0

Version 1.6.1

19 Dec 21:03
Compare
Choose a tag to compare

What's new

  • Fix bug in table to markdown

Full Changelog: v1.6.0...v1.6.1