Snippet: concatenating various dicts #4

madig · 2023-10-09T12:41:37Z

I.e. when you want to make a big dictionary for all of Latin, from the data in git://anongit.freedesktop.org/libreoffice/dictionaries:

from pathlib import Path
from fontTools.unicodedata import script

ACCEPTED_SCRIPTS = {"Zyyy", "Zzzz", "Zinh", "Latn"}

dic_data = []
for p in Path("../dictionaries").glob("**/*.aff"):
    aff = p.read_bytes()
    encoding = None
    for line in aff.splitlines():
        if line.startswith(b"SET"):
            encoding = line.replace(b"\t", b" ").split(b" ")[1]
            break
    if encoding is None:
        print("Can't find encoding for", p, ", assuming utf-8")
        encoding = "utf-8"
    else:
        encoding = encoding.decode("ascii")
    print("Reading", p, "with encoding", encoding)
    try_chars = None
    for line in aff.splitlines():
        if line.startswith(b"TRY"):
            try_chars = line.replace(b"\t", b" ").split(b" ")[1]
            break
    if try_chars is None:
        print("Can't find TRY for", p)
        continue
    try_chars = try_chars.decode(encoding)
    if any(script(c) not in ACCEPTED_SCRIPTS for c in try_chars):
        print("Not Latin, skipping", p)
        continue

    dic = p.with_suffix(".dic").read_text(encoding=encoding).splitlines()
    del dic[0]  # Remove "number of entries" line
    dic_data.extend(dic)

Path("all.dic").write_text("\n".join(dic_data))

Note: some .aff files like Hungarian start with some ISO encoding and then switch to UTF-8, so you'll have to read the files in as bytes.

The text was updated successfully, but these errors were encountered:

behdad · 2023-10-09T16:51:42Z

Oh thank you!

behdad · 2023-10-17T22:55:31Z

@madig Any chance you can somehow integrate this into the CLI? If not, I'll do it. Just thought to ask. :)

madig · 2023-10-27T11:36:38Z

Uhm, eventually, but I'm currently busy with other stuff...

Based on #4

behdad · 2023-11-15T18:26:49Z

I incorporated some of this into the code now:

halfkern/ngrams.py

Line 75 in ed6d2df

# Assume hunspell dictionary format;

behdad added a commit that referenced this issue Nov 10, 2023

Load encoding from libreoffice dictionaries

9d1e9bf

Based on #4

behdad added a commit that referenced this issue Nov 10, 2023

Load encoding from libreoffice dictionaries

f946ae3

Based on #4

behdad added a commit that referenced this issue Nov 10, 2023

Load encoding from libreoffice dictionaries

7bfd03c

Based on #4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snippet: concatenating various dicts #4

Snippet: concatenating various dicts #4

madig commented Oct 9, 2023 •

edited

Loading

behdad commented Oct 9, 2023

behdad commented Oct 17, 2023

madig commented Oct 27, 2023

behdad commented Nov 15, 2023 •

edited

Loading

Snippet: concatenating various dicts #4

Snippet: concatenating various dicts #4

Comments

madig commented Oct 9, 2023 • edited Loading

behdad commented Oct 9, 2023

behdad commented Oct 17, 2023

madig commented Oct 27, 2023

behdad commented Nov 15, 2023 • edited Loading

madig commented Oct 9, 2023 •

edited

Loading

behdad commented Nov 15, 2023 •

edited

Loading