From f74b1ba6d1522455e623814b98967ee930f67bbf Mon Sep 17 00:00:00 2001 From: Matt Miller Date: Mon, 1 Jul 2024 09:49:42 -0400 Subject: [PATCH 1/2] The list (#114) * Supported script documentation. * Fix links, not that they work right now. * Update README.md * fixed .yml extention in links * Italicization. --------- Co-authored-by: Kevin Ford --- README.md | 2 + doc/supported_scripts.md | 109 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 111 insertions(+) create mode 100644 doc/supported_scripts.md diff --git a/README.md b/README.md index b1176b0..23c82dd 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,8 @@ REST API service to convert non-Latin scripts to Latin, and vice versa. +[View supported scripts](/doc/supported_scripts.md). + ## Environment variables The provided `example.env` can be renamed to `.env` in your deployment and/or diff --git a/doc/supported_scripts.md b/doc/supported_scripts.md new file mode 100644 index 0000000..f81f5e0 --- /dev/null +++ b/doc/supported_scripts.md @@ -0,0 +1,109 @@ +## Supported Scripts/Mappings in ScriptShifter + +Below are the supported scripts, and supported directionality of those scripts, in ScriptShifter. + +The "status" value may be *stable*, *beta*, *alpha*, or blank (i.e. empty). "Stable" means the mapping +is maintained within ScriptShifter, has been tested, and is in use. "Beta" or "alpha" represent mappings +that are in some form of development and/or testing, with "beta" being more mature than "alpha" in this +regard. If the column is 'blank,' transliteration of the script is available in ScriptShifter from a +third-party library. + + +| Mapping file | Script Name | Roman-to-script | Script-to-roman | Status | Remarks +| -------- | ------- | ------- | ------- | ------- | ------- | +| [abkhaz_cyrillic](../scriptshifter/tables/data/abkhaz_cyrillic.yml) | Abkhaz (Cyrillic) | Y | Y | stable | +| [altai_cyrillic](../scriptshifter/tables/data/altai_cyrillic.yml) | Altai (Cyrillic) | Y | Y | stable | +| [arabic](../scriptshifter/tables/data/arabic.yml) | Arabic (S2R) | N | Y | stable | +| [armenian](../scriptshifter/tables/data/armenian.yml) | Armenian | Y | Y | stable | +| [asian_cyrillic](../scriptshifter/tables/data/asian_cyrillic.yml) | Asian Cyrillic | Y | Y | stable | +| [azerbaijani_cyrillic](../scriptshifter/tables/data/azerbaijani_cyrillic.yml) | Azerbaijani (Cyrillic) | Y | Y | stable | +| [bashkir_cyrillic](../scriptshifter/tables/data/bashkir_cyrillic.yml) | Bashkir (Cyrillic) | Y | Y | stable | +| [belarusian](../scriptshifter/tables/data/belarusian.yml) | Belarusian | Y | Y | stable | +| [bengali](../scriptshifter/tables/data/bengali.yml) | Bengali | Y | Y | | +| [bulgarian](../scriptshifter/tables/data/bulgarian.yml) | Bulgarian | Y | Y | stable | +| [buriat](../scriptshifter/tables/data/buriat.yml) | Buriat (Cyrillic) | Y | Y | stable | +| [burmese](../scriptshifter/tables/data/burmese.yml) | Burmese (Myanmar) | Y | Y | | +| [chinese](../scriptshifter/tables/data/chinese.yml) | Chinese (Hanzi) | N | Y | stable | +| [chukchi_cyrillic](../scriptshifter/tables/data/chukchi_cyrillic.yml) | Chukchi (Cyrillic) | Y | Y | stable | +| [church_slavonic](../scriptshifter/tables/data/church_slavonic.yml) | Church Slavonic | Y | Y | stable | +| [chuvash_cyrillic](../scriptshifter/tables/data/chuvash_cyrillic.yml) | Chuvash (Cyrillic) | Y | Y | stable | +| [devanagari](../scriptshifter/tables/data/devanagari.yml) | Devanagari | Y | Y | | +| [divehi_thaana](../scriptshifter/tables/data/divehi_thaana.yml) | Divehi (Thaana) | Y | Y | stable | +| [dogri_devanagari](../scriptshifter/tables/data/dogri_devanagari.yml) | Dogri (Devanagari) | Y | Y | | +| [dungan_cyrillic](../scriptshifter/tables/data/dungan_cyrillic.yml) | Dungan (Cyrillic) | Y | Y | stable | +| [ethiopic](../scriptshifter/tables/data/ethiopic.yml) | Ethiopic (Amharic) | Y | Y | beta | +| [even-evenki_cyrillic](../scriptshifter/tables/data/even.yml) | Even/Evenki (Cyrillic) | Y | Y | stable | +| [gagauz_cyrillic](../scriptshifter/tables/data/gagauz_cyrillic.yml) | Gagauz (Cyrillic) | Y | Y | stable | +| [georgian](../scriptshifter/tables/data/georgian.yml) | Georgian | Y | Y | stable | +| [greek_classical](../scriptshifter/tables/data/greek_classical.yml) | Greek (classical) | Y | Y | stable | +| [greek_modern](../scriptshifter/tables/data/greek_modern.yml) | Greek (modern) | Y | Y | stable | +| [gujarati](../scriptshifter/tables/data/gujarati.yml) | Gujarati | Y | Y | | s-to-r lacks capitalization +| [gurmukhi](../scriptshifter/tables/data/gurmukhi.yml) | Punjabi (Gurmukhi) | Y | Y | | +| [hebrew](../scriptshifter/tables/data/hebrew.yml) | Hebrew | N | Y | | +| [hindi](../scriptshifter/tables/data/hindi.yml) | Hindi (Devanagari) | Y | Y | beta | +| [hiragana](../scriptshifter/tables/data/hiragana.yml) | Japanese (Hiragana) | Y | Y | | +| [kalmyk_cyrillic](../scriptshifter/tables/data/kalmyk_cyrillic.yml) | Kalmyk (Cyrillic) | Y | Y | stable | +| [kannada](../scriptshifter/tables/data/kannada.yml) | Kannada | Y | Y | | s-to-r lacks capitalization +| [kara-kalpak_cyrillic](../scriptshifter/tables/data/kara.yml) | Kara-Kalpak (Cyrillic) | Y | Y | stable | +| [karachai-balkar_cyrillic](../scriptshifter/tables/data/karachai.yml) | Karachay-Balkar (Cyrillic) | Y | Y | stable | +| [karelian_cyrillic](../scriptshifter/tables/data/karelian_cyrillic.yml) | Karelian (Cyrillic) | Y | Y | stable | +| [katakana](../scriptshifter/tables/data/katakana.yml) | Japanese (Katakana) | Y | Y | | +| [kazakh_cyrillic](../scriptshifter/tables/data/kazakh_cyrillic.yml) | Kazakh (Cyrillic) | Y | Y | stable | +| [khakass_cyrillic](../scriptshifter/tables/data/khakass_cyrillic.yml) | Khakass (Cyrillic) | Y | Y | stable | +| [khanty_cyrillic](../scriptshifter/tables/data/khanty_cyrillic.yml) | Khanty (Cyrillic) | Y | Y | stable | +| [khmer](../scriptshifter/tables/data/khmer.yml) | Khmer | Y | Y | | +| [komi_cyrillic](../scriptshifter/tables/data/komi_cyrillic.yml) | Komi (Cyrillic) | Y | Y | stable | +| [korean_names](../scriptshifter/tables/data/korean_names.yml) | Korean (last + first names only) | N | Y | | +| [korean_nonames](../scriptshifter/tables/data/korean_nonames.yml) | Korean | N | Y | | +| [korean_old](../scriptshifter/tables/data/korean_old.yml) | Korean | x | x | | +| [koryak_cyrillic](../scriptshifter/tables/data/koryak_cyrillic.yml) | Koryak (Cyrillic) | Y | Y | stable | +| [kurdish](../scriptshifter/tables/data/kurdish.yml) | Kurdish | Y | N | stable | +| [kyrgyz_cyrillic](../scriptshifter/tables/data/kyrgyz_cyrillic.yml) | Kyrgyz (Cyrillic) | Y | Y | stable | +| [lithuanian_cyrillic](../scriptshifter/tables/data/lithuanian_cyrillic.yml) | Lithuanian (Cyrillic) | Y | Y | stable | +| [macedonian](../scriptshifter/tables/data/macedonian.yml) | Macedonian | Y | Y | stable | +| [malayalam](../scriptshifter/tables/data/malayalam.yml) | Malayalam | Y | Y | | s-to-r lacks capitalization +| [mansi_cyrillic](../scriptshifter/tables/data/mansi_cyrillic.yml) | Mansi (Cyrillic) | Y | Y | stable | +| [marathi](../scriptshifter/tables/data/marathi.yml) | Marathi | Y | Y | | s-to-r lacks capitalization +| [mari_cyrillic](../scriptshifter/tables/data/mari_cyrillic.yml) | Mari (Cyrillic) | Y | Y | stable | +| [moldovan_cyrillic](../scriptshifter/tables/data/moldovan_cyrillic.yml) | Moldovan (Cyrillic) | Y | Y | stable | +| [mongolian_cyrillic](../scriptshifter/tables/data/mongolian_cyrillic.yml) | Mongolian (Cyrillic) | Y | Y | stable | +| [mongolian_mongol_bichig](../scriptshifter/tables/data/mongolian_mongol_bichig.yml) | Mongolian (Mongol bichig) | Y | Y | stable | +| [mordvin_cyrillic](../scriptshifter/tables/data/mordvin_cyrillic.yml) | Mordvin (Cyrillic) | Y | Y | stable | +| [nenets_cyrillic](../scriptshifter/tables/data/nenets_cyrillic.yml) | Nenets (Cyrillic) | Y | Y | stable | +| [nepali_devanagari](../scriptshifter/tables/data/nepali_devanagari.yml) | Nepali (Devanagari) | Y | Y | | +| [newari_devanagari](../scriptshifter/tables/data/newari_devanagari.yml) | Newari (Devanagari) | Y | Y | | +| [oriya](../scriptshifter/tables/data/oriya.yml) | Oriya | Y | Y | | s-to-r lacks capitalization +| [ossetic_cyrillic](../scriptshifter/tables/data/ossetic_cyrillic.yml) | Ossetic (Cyrillic) | Y | Y | stable | +| [pali](../scriptshifter/tables/data/pali.yml) | Pali | Y | Y | | +| [panjabi](../scriptshifter/tables/data/panjabi.yml) | Punjabi (Gurmukhi) | Y | Y | | s-to-r lacks capitalization +| [persian](../scriptshifter/tables/data/persian.yml) | Persian | Y | N | stable | +| [prakrit_devanagari](../scriptshifter/tables/data/prakrit_devanagari.yml) | Prakrit (Devanagari) | Y | Y | | +| [pulaar](../scriptshifter/tables/data/pulaar.yml) | Pulaar (Adlam) | Y | Y | | +| [pushto](../scriptshifter/tables/data/pushto.yml) | Pushto | Y | N | stable | +| [rajasthani_devanagari](../scriptshifter/tables/data/rajasthani_devanagari.yml) | Rajasthani (Devanagari) | Y | Y | | +| [romani_cyrillic](../scriptshifter/tables/data/romani_cyrillic.yml) | Romani (Cyrillic) | Y | Y | stable | +| [russian](../scriptshifter/tables/data/russian.yml) | Russian | Y | Y | stable | +| [sanskrit_devanagari](../scriptshifter/tables/data/sanskrit_devanagari.yml) | Sanskrit (Devanagari) | Y | Y | | s-to-r lacks capitalization +| [serbian](../scriptshifter/tables/data/serbian.yml) | Serbian | Y | Y | stable | +| [shor_cyrillic](../scriptshifter/tables/data/shor_cyrillic.yml) | Shor (Cyrillic) | Y | Y | stable | +| [sinhalese_sinhala](../scriptshifter/tables/data/sinhalese_sinhala.yml) | Sinhalese (Sinhala) | Y | Y | | s-to-r lacks capitalization +| [syriac_cyrillic](../scriptshifter/tables/data/syriac_cyrillic.yml) | Syriac (Cyrillic) | Y | Y | stable | +| [tajik_cyrillic](../scriptshifter/tables/data/tajik_cyrillic.yml) | Tajik (Cyrillic) | Y | Y | stable | +| [tamil](../scriptshifter/tables/data/tamil.yml) | Tamil | Y | Y | beta | +| [tamil_brahmi](../scriptshifter/tables/data/tamil_brahmi.yml) | Tamil Brahmi | Y | Y | | +| [tamil_extended](../scriptshifter/tables/data/tamil_extended.yml) | Tamil (extended) | Y | Y | | +| [tatar-kryashen_cyrillic](../scriptshifter/tables/data/tatar.yml) | Tatar-Kryashen (Cyrillic) | Y | Y | stable | +| [tatar_cyrillic](../scriptshifter/tables/data/tatar_cyrillic.yml) | Tatar (Cyrillic) | Y | Y | stable | +| [telugu](../scriptshifter/tables/data/telugu.yml) | Telugu | Y | Y | | s-to-r lacks capitalization +| [thai](../scriptshifter/tables/data/thai.yml) | Thai | Y | Y | | +| [tibetan](../scriptshifter/tables/data/tibetan.yml) | Tibetan | Y | Y | | +| [turkmen_cyrillic](../scriptshifter/tables/data/turkmen_cyrillic.yml) | Turkmen (Cyrillic) | Y | Y | stable | +| [tuvinian_cyrillic](../scriptshifter/tables/data/tuvinian_cyrillic.yml) | Tuvinian (Cyrillic) | Y | Y | stable | +| [udmurt_cyrillic](../scriptshifter/tables/data/udmurt_cyrillic.yml) | Udmurt (Cyrillic) | Y | Y | stable | +| [uighur_cyrillic](../scriptshifter/tables/data/uighur_cyrillic.yml) | Uighur (Cyrillic) | Y | Y | stable | +| [ukrainian](../scriptshifter/tables/data/ukrainian.yml) | Ukrainian | Y | Y | stable | +| [urdu](../scriptshifter/tables/data/urdu.yml) | Urdu | Y | N | stable | +| [uzbek_cyrillic](../scriptshifter/tables/data/uzbek_cyrillic.yml) | Uzbek (Cyrillic) | Y | Y | stable | +| [yakut_cyrillic](../scriptshifter/tables/data/yakut_cyrillic.yml) | Yakut (Cyrillic) | Y | Y | stable | +| [yiddish](../scriptshifter/tables/data/yiddish.yml) | Yiddish | Y | Y | | +| [yuit_cyrillic](../scriptshifter/tables/data/yuit_cyrillic.yml) | Yuit (Cyrillic) | Y | Y | stable | From 7f1c33f8ef3f9d39b935908afa0514d0daa78462 Mon Sep 17 00:00:00 2001 From: scossu Date: Sun, 7 Jul 2024 20:43:17 -0400 Subject: [PATCH 2/2] Fix Greek numerals logic; add test strings. --- scriptshifter/hooks/greek/__init__.py | 70 ++++++++++++++----- scriptshifter/tables/data/greek_classical.yml | 2 + tests/data/script_samples/greek.csv | 5 +- 3 files changed, 60 insertions(+), 17 deletions(-) diff --git a/scriptshifter/hooks/greek/__init__.py b/scriptshifter/hooks/greek/__init__.py index b0c0b10..f098375 100644 --- a/scriptshifter/hooks/greek/__init__.py +++ b/scriptshifter/hooks/greek/__init__.py @@ -77,12 +77,17 @@ def parse_numeral(ctx): characters mixed with letter characters without a space. Therefore, "͵ακακαα" would transliterate "1021kaa", and "͵αακαα", "1001kaa". """ - # Parse thousands. + # Parse ≥1000. if ctx.src[ctx.cur] == THOUSANDS_PREFIX: tk = ctx.src[ctx.cur + 1] try: - ctx.dest_ls.append(str(DIGITS[4][tk])) + # Exception for 2-letter digit. + if ctx.src[ctx.cur + 1: ctx.cur + 3] == "στ": + ctx.dest_ls.append(str(DIGITS[4]["στ"])) + ctx.cur += 1 + else: + ctx.dest_ls.append(str(DIGITS[4][tk])) ctx.cur += 2 except KeyError: @@ -104,8 +109,13 @@ def parse_numeral(ctx): ext[ext_cur] = str(DIGITS[3 - i][ctx.src[ctx.cur]]) ctx.cur += 1 except KeyError: - # If the number char is not in the correct position, pad with 0 - continue + # Exception for 2-letter digit. + if i == 2 and ctx.src[ctx.cur: ctx.cur + 2] == "στ": + ext[ext_cur] = "6" + ctx.cur += 2 + else: + # If the char is not in the correct position, pad with 0. + continue finally: ext_cur += 1 ctx.dest_ls.extend(ext) @@ -119,23 +129,51 @@ def parse_numeral(ctx): # transliterated characters. if ctx.src[ctx.cur] == NUM_SUFFIX: # Move back up to 3 positions. - for i in range(1, 4): - cur = ctx.cur - i + offset = 0 # Added offset if στ is found. + parsed = 0 # Parsed numeral to replace the alpha characters. + breakout = False # Break out of i loop. + + i = 1 # Current position in the numeral. 1 = units, 2 = tens, etc. + mark_pos = ctx.cur # Mark this position to resume parsing later. + while i < 4: + if breakout: + break + cur = ctx.cur - i - offset if cur >= 0: num_tk = ctx.src[cur] # Number to be parsed - if ctx.src[cur] in DIGITS[i]: - # Not yet reached word boundary. - ctx.dest_ls[-i] = str(DIGITS[i][num_tk]) - else: - if ctx.src[cur] != " ": # Word boundary. - # Something's wrong. + # Exception for στ. Scan one character farther left. + if ctx.src[cur - 1:cur + 1] == "στ": + num_tk = "στ" + offset = 1 + for j in range(i, 4): + i = j + if num_tk in DIGITS[j]: + # Not yet reached word boundary. + parsed += DIGITS[j][num_tk] * 10 ** (j - 1) + break + + if num_tk == " " or cur == 0: # Word boundary. + breakout = True + break + + # If we got here we tried all positions without finding a + # match. Something's wrong. + if j == 3: + # continue ctx.warnings.append( - f"Character `{ctx.src[cur] }` at position " + f"Character `{num_tk}` at position " f"{cur} is not a valid digit character " f"at place #{4 - i} in a numeral.") - ctx.cur += 1 - return CONT # Continue normal parsing. + # ctx.cur += 1 + offset + # return CONT # Continue normal parsing. + i += 1 + + if parsed > 0: + ctx.dest_ls = ( + ctx.dest_ls[:mark_pos - len(str(parsed)) - offset] + + [str(parsed)]) + + ctx.cur = mark_pos + 1 # Skip past numeral suffix. - ctx.cur += 1 return CONT diff --git a/scriptshifter/tables/data/greek_classical.yml b/scriptshifter/tables/data/greek_classical.yml index 68a8be4..15507f2 100644 --- a/scriptshifter/tables/data/greek_classical.yml +++ b/scriptshifter/tables/data/greek_classical.yml @@ -344,6 +344,7 @@ script_to_roman: "\u037C": "(." "\u037D": ".)" "\u037E": "?\u0333" + ";": "?" "\u037F": "J" # \u0380 reserved # \u0381 reserved @@ -594,6 +595,7 @@ script_to_roman: ".)\u0333": "\u03FF" ".)": "\u037D" "?\u0333": "\u037E" + "?": "\u037E" "\"\u0332": "\u201C" "\"\u0333": "\u201D" "'\u0332": "\u2018" diff --git a/tests/data/script_samples/greek.csv b/tests/data/script_samples/greek.csv index e5223de..e189653 100644 --- a/tests/data/script_samples/greek.csv +++ b/tests/data/script_samples/greek.csv @@ -10,7 +10,7 @@ greek_classical,ἀΰπνους νύκτας ἴαυον,aypnous nyktas iauon,, greek_classical,Λητοῦς καὶ Διὸς υἱός,Lētous kai Dios huios,, greek_classical,ὑϊκὸν πάσχειν,hyikon paschein,, greek_classical,εἶπε πρὸς τὸν ἄνδρα τὸν ἑωυτῆς,eipe pros ton andra ton heōutēs,, -greek_classical,τί τοῦδ’ ἂν εὕρημ’ ηὗρον εὐτυχέστερον;,ti toud’ an heurēm’ hēuron eutychesteron,, +greek_classical,τί τοῦδ’ ἂν εὕρημ’ ηὗρον εὐτυχέστερον;,ti toud’ an heurēm’ hēuron eutychesteron?,, greek_classical,Τοῦ Κατὰ πασῶν αἱρέσεων ἐλέγχου βιβλίον αʹ,Tou Kata pasōn haireseōn elenchou biblion 1,, greek_classical,καλὸν κἀγαθόν,kalon kagathon,, greek_classical,ᾤχοντο θοἰμάτιον λαβόντες μου,ōchonto thoimation labontes mou,, @@ -21,6 +21,9 @@ greek_classical,ἄλαϲτα δὲ ϝέργα πάθον κακὰ μηϲαμέ greek_classical,Δαμαρέτα τ’ ἐρατά τε Ϝιανθεμίϲ,Damareta t’ erata te Wianthemis,, greek_classical,ξένϝος,xenwos,, greek_classical,Πάτροϙλος,Patroḳlos,, +greek_classical,"λβʹ. Ἐπεὶ δὲ ἡ τύχη κράτιστον ἐπὶ πάντα τὰ ἀνθρώπεια, μηδὲ Ἡλιόδωρος ἀπαξιούσθω σοφιστῶν κύκλου παράδοξον ἀγώνισμα τύχης γενόμενος·","32. Epei de ē tychi kratiston epi panta ta anthrōpeia, mide Hēliodōros apaxiousthō sophistōn kyklou paradoxon agōnisma tychis genomenos",, +greek_classical,"κζʹ. Μὴ δεύτερα τῶν προειρημένων σοφιστῶν μηδὲ Ἱππόδρομόν τις ἡγείσθω τὸν Θετταλόν, τῶν μὲν γὰρ βελτίων φαίνεται, τῶν δὲ οὐκ οἶδα ὅ τι λείπεται","27. Mē deutera tōn proeirēmenōn sophistōn mide Ippodromon tis ēgeisthō ton Thettalon, tōn men gar beltiōn phainetai, tōn de ouk oida o ti leipetai",, +greek_classical,"ιγʹ. Πῶλον δὲ τὸν Ἀκραγαντῖνον Γοργίας σοφιστὴν ἐξεμελέτησε πολλῶν, ὥς φασι, χρημάτων, καὶ γὰρ δὴ καὶ τῶν πλουτούντων ὁ Πῶλος.","13. Pōlon de ton Akragantinon Gorgias sophistēn exemeletēse pollōn, ōs phasi, chrēmatōn, kai gar dē kai tōn ploutountōn o Pōlos",, greek_modern,"Ἐτήσια ἔκθεσις / Κυπριακὴ Δημοκρατία, Ὑπουργεῖον Ἐργασίας καὶ Κοινωνικῶν Ἀσφαλίσεων","Etēsia ekthesis / Kypriakē Dēmokratia, Hypourgeion Ergasias kai Koinōnikōn Asphaliseōn",, greek_modern,"Ετήσια έκθεση / Κυπριακή Δημοκρατία, Υπουργείο Εργασίας και Κοινωνικών Ασφαλίσεων","Etēsia ekthesē / Kypriakē Dēmokratia, Hypourgeio Ergasias kai Koinōnikōn Asphaliseōn",, greek_modern,Ελληνικό Ίδρυμα Ευρωπαϊκής και Εξωτερικής Πολιτικής,Hellēniko Hidryma Eurōpaikēs kai Exōterikēs Politikēs,,