Skip to content

Commit

Permalink
remove textract references
Browse files Browse the repository at this point in the history
  • Loading branch information
tfeldmann committed Feb 17, 2024
1 parent f16e4ca commit 19dc073
Show file tree
Hide file tree
Showing 5 changed files with 9 additions and 16 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,9 @@ jobs:
PYTHON_KEYRING_BACKEND: keyring.backends.null.Keyring
run: |
python3 -m pip install -U pip setuptools
python3 -m pip install poetry==1.7.1 lxml
python3 -m pip install poetry==1.7.1
poetry config virtualenvs.create false
poetry install --with=dev --extras=textract
poetry install --with=dev
- name: Version info
run: |
Expand Down
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
## [Unreleased]

- Integrated `pdftotext`, `pdfminer` and `docx2txt` interfaces into `filecontent` filter.
- Removed `textract` and many other dependencies as they are no longer needed.
- Removed `textract` and ~50 MB of dependencies as they are no longer needed.
- Python 3.12 support

## v3.1.1 (2024-02-11)
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ FROM base as pydeps
RUN pip install "poetry==1.7.1" && \
python -m venv ${VIRTUAL_ENV}
COPY pyproject.toml poetry.lock ./
RUN poetry install --only=main --extras=textract --no-interaction
RUN poetry install --only=main --no-interaction


FROM base as final
Expand Down
7 changes: 0 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,13 +77,6 @@ Installation is done via pip. Note that the package name is `organize-tool`:
pip install -U organize-tool
```

If you want the text extraction capabilities, install with `textract` like this (the
qoutes are important):

```bash
pip install "organize-tool[texttract]"
```

This command can also be used to update to the newest version. Now you can run `organize --help` to check if the installation was successful.

### Create your first rule
Expand Down
10 changes: 5 additions & 5 deletions organize/filters/filecontent.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import subprocess
from functools import lru_cache
from pathlib import Path
from typing import Any, ClassVar
from typing import Any, Callable, ClassVar, Dict

from pydantic.config import ConfigDict
from pydantic.dataclasses import dataclass
Expand Down Expand Up @@ -54,9 +54,9 @@ def _pdftotext_available() -> bool:

def _extract_with_pdftotext(path: Path, keep_layout: bool) -> str:
if keep_layout:
args = ("-layout", str(path), "-")
args = ["-layout", str(path), "-"]
else:
args = (str(path), "-")
args = [str(path), "-"]
result = subprocess.check_output(("pdftotext", *args), text=True)
return clean(result)

Expand All @@ -74,13 +74,13 @@ def extract_pdf(path: Path, keep_layout: bool = True) -> str:


def extract_docx(path: Path) -> str:
import docx2txt
import docx2txt # type: ignore

result = docx2txt.process(path)
return clean(result)


EXTRACTORS = {
EXTRACTORS: Dict[str, Callable[[Path], str]] = {
".md": extract_txt,
".txt": extract_txt,
".log": extract_txt,
Expand Down

0 comments on commit 19dc073

Please sign in to comment.