Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snippet: concatenating various dicts #4

Open
madig opened this issue Oct 9, 2023 · 4 comments
Open

Snippet: concatenating various dicts #4

madig opened this issue Oct 9, 2023 · 4 comments

Comments

@madig
Copy link

madig commented Oct 9, 2023

I.e. when you want to make a big dictionary for all of Latin, from the data in git://anongit.freedesktop.org/libreoffice/dictionaries:

from pathlib import Path
from fontTools.unicodedata import script

ACCEPTED_SCRIPTS = {"Zyyy", "Zzzz", "Zinh", "Latn"}

dic_data = []
for p in Path("../dictionaries").glob("**/*.aff"):
    aff = p.read_bytes()
    encoding = None
    for line in aff.splitlines():
        if line.startswith(b"SET"):
            encoding = line.replace(b"\t", b" ").split(b" ")[1]
            break
    if encoding is None:
        print("Can't find encoding for", p, ", assuming utf-8")
        encoding = "utf-8"
    else:
        encoding = encoding.decode("ascii")
    print("Reading", p, "with encoding", encoding)
    try_chars = None
    for line in aff.splitlines():
        if line.startswith(b"TRY"):
            try_chars = line.replace(b"\t", b" ").split(b" ")[1]
            break
    if try_chars is None:
        print("Can't find TRY for", p)
        continue
    try_chars = try_chars.decode(encoding)
    if any(script(c) not in ACCEPTED_SCRIPTS for c in try_chars):
        print("Not Latin, skipping", p)
        continue

    dic = p.with_suffix(".dic").read_text(encoding=encoding).splitlines()
    del dic[0]  # Remove "number of entries" line
    dic_data.extend(dic)

Path("all.dic").write_text("\n".join(dic_data))

Note: some .aff files like Hungarian start with some ISO encoding and then switch to UTF-8, so you'll have to read the files in as bytes.

@behdad
Copy link
Owner

behdad commented Oct 9, 2023

Oh thank you!

@behdad
Copy link
Owner

behdad commented Oct 17, 2023

@madig Any chance you can somehow integrate this into the CLI? If not, I'll do it. Just thought to ask. :)

@madig
Copy link
Author

madig commented Oct 27, 2023

Uhm, eventually, but I'm currently busy with other stuff...

behdad added a commit that referenced this issue Nov 10, 2023
behdad added a commit that referenced this issue Nov 10, 2023
behdad added a commit that referenced this issue Nov 10, 2023
@behdad
Copy link
Owner

behdad commented Nov 15, 2023

I incorporated some of this into the code now:

# Assume hunspell dictionary format;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants