Skip to content

Commit

Permalink
Merge branch 'greek_numerals' into test
Browse files Browse the repository at this point in the history
  • Loading branch information
scossu committed Jul 9, 2024
2 parents 157668b + 7f1c33f commit 5bb20bf
Show file tree
Hide file tree
Showing 5 changed files with 171 additions and 17 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

REST API service to convert non-Latin scripts to Latin, and vice versa.

[View supported scripts](/doc/supported_scripts.md).

## Environment variables

The provided `example.env` can be renamed to `.env` in your deployment and/or
Expand Down
109 changes: 109 additions & 0 deletions doc/supported_scripts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
## Supported Scripts/Mappings in ScriptShifter

Below are the supported scripts, and supported directionality of those scripts, in ScriptShifter.

The "status" value may be *stable*, *beta*, *alpha*, or blank (i.e. empty). "Stable" means the mapping
is maintained within ScriptShifter, has been tested, and is in use. "Beta" or "alpha" represent mappings
that are in some form of development and/or testing, with "beta" being more mature than "alpha" in this
regard. If the column is 'blank,' transliteration of the script is available in ScriptShifter from a
third-party library.


| Mapping file | Script Name | Roman-to-script | Script-to-roman | Status | Remarks
| -------- | ------- | ------- | ------- | ------- | ------- |
| [abkhaz_cyrillic](../scriptshifter/tables/data/abkhaz_cyrillic.yml) | Abkhaz (Cyrillic) | Y | Y | stable |
| [altai_cyrillic](../scriptshifter/tables/data/altai_cyrillic.yml) | Altai (Cyrillic) | Y | Y | stable |
| [arabic](../scriptshifter/tables/data/arabic.yml) | Arabic (S2R) | N | Y | stable |
| [armenian](../scriptshifter/tables/data/armenian.yml) | Armenian | Y | Y | stable |
| [asian_cyrillic](../scriptshifter/tables/data/asian_cyrillic.yml) | Asian Cyrillic | Y | Y | stable |
| [azerbaijani_cyrillic](../scriptshifter/tables/data/azerbaijani_cyrillic.yml) | Azerbaijani (Cyrillic) | Y | Y | stable |
| [bashkir_cyrillic](../scriptshifter/tables/data/bashkir_cyrillic.yml) | Bashkir (Cyrillic) | Y | Y | stable |
| [belarusian](../scriptshifter/tables/data/belarusian.yml) | Belarusian | Y | Y | stable |
| [bengali](../scriptshifter/tables/data/bengali.yml) | Bengali | Y | Y | |
| [bulgarian](../scriptshifter/tables/data/bulgarian.yml) | Bulgarian | Y | Y | stable |
| [buriat](../scriptshifter/tables/data/buriat.yml) | Buriat (Cyrillic) | Y | Y | stable |
| [burmese](../scriptshifter/tables/data/burmese.yml) | Burmese (Myanmar) | Y | Y | |
| [chinese](../scriptshifter/tables/data/chinese.yml) | Chinese (Hanzi) | N | Y | stable |
| [chukchi_cyrillic](../scriptshifter/tables/data/chukchi_cyrillic.yml) | Chukchi (Cyrillic) | Y | Y | stable |
| [church_slavonic](../scriptshifter/tables/data/church_slavonic.yml) | Church Slavonic | Y | Y | stable |
| [chuvash_cyrillic](../scriptshifter/tables/data/chuvash_cyrillic.yml) | Chuvash (Cyrillic) | Y | Y | stable |
| [devanagari](../scriptshifter/tables/data/devanagari.yml) | Devanagari | Y | Y | |
| [divehi_thaana](../scriptshifter/tables/data/divehi_thaana.yml) | Divehi (Thaana) | Y | Y | stable |
| [dogri_devanagari](../scriptshifter/tables/data/dogri_devanagari.yml) | Dogri (Devanagari) | Y | Y | |
| [dungan_cyrillic](../scriptshifter/tables/data/dungan_cyrillic.yml) | Dungan (Cyrillic) | Y | Y | stable |
| [ethiopic](../scriptshifter/tables/data/ethiopic.yml) | Ethiopic (Amharic) | Y | Y | beta |
| [even-evenki_cyrillic](../scriptshifter/tables/data/even.yml) | Even/Evenki (Cyrillic) | Y | Y | stable |
| [gagauz_cyrillic](../scriptshifter/tables/data/gagauz_cyrillic.yml) | Gagauz (Cyrillic) | Y | Y | stable |
| [georgian](../scriptshifter/tables/data/georgian.yml) | Georgian | Y | Y | stable |
| [greek_classical](../scriptshifter/tables/data/greek_classical.yml) | Greek (classical) | Y | Y | stable |
| [greek_modern](../scriptshifter/tables/data/greek_modern.yml) | Greek (modern) | Y | Y | stable |
| [gujarati](../scriptshifter/tables/data/gujarati.yml) | Gujarati | Y | Y | | s-to-r lacks capitalization
| [gurmukhi](../scriptshifter/tables/data/gurmukhi.yml) | Punjabi (Gurmukhi) | Y | Y | |
| [hebrew](../scriptshifter/tables/data/hebrew.yml) | Hebrew | N | Y | |
| [hindi](../scriptshifter/tables/data/hindi.yml) | Hindi (Devanagari) | Y | Y | beta |
| [hiragana](../scriptshifter/tables/data/hiragana.yml) | Japanese (Hiragana) | Y | Y | |
| [kalmyk_cyrillic](../scriptshifter/tables/data/kalmyk_cyrillic.yml) | Kalmyk (Cyrillic) | Y | Y | stable |
| [kannada](../scriptshifter/tables/data/kannada.yml) | Kannada | Y | Y | | s-to-r lacks capitalization
| [kara-kalpak_cyrillic](../scriptshifter/tables/data/kara.yml) | Kara-Kalpak (Cyrillic) | Y | Y | stable |
| [karachai-balkar_cyrillic](../scriptshifter/tables/data/karachai.yml) | Karachay-Balkar (Cyrillic) | Y | Y | stable |
| [karelian_cyrillic](../scriptshifter/tables/data/karelian_cyrillic.yml) | Karelian (Cyrillic) | Y | Y | stable |
| [katakana](../scriptshifter/tables/data/katakana.yml) | Japanese (Katakana) | Y | Y | |
| [kazakh_cyrillic](../scriptshifter/tables/data/kazakh_cyrillic.yml) | Kazakh (Cyrillic) | Y | Y | stable |
| [khakass_cyrillic](../scriptshifter/tables/data/khakass_cyrillic.yml) | Khakass (Cyrillic) | Y | Y | stable |
| [khanty_cyrillic](../scriptshifter/tables/data/khanty_cyrillic.yml) | Khanty (Cyrillic) | Y | Y | stable |
| [khmer](../scriptshifter/tables/data/khmer.yml) | Khmer | Y | Y | |
| [komi_cyrillic](../scriptshifter/tables/data/komi_cyrillic.yml) | Komi (Cyrillic) | Y | Y | stable |
| [korean_names](../scriptshifter/tables/data/korean_names.yml) | Korean (last + first names only) | N | Y | |
| [korean_nonames](../scriptshifter/tables/data/korean_nonames.yml) | Korean | N | Y | |
| [korean_old](../scriptshifter/tables/data/korean_old.yml) | Korean | x | x | |
| [koryak_cyrillic](../scriptshifter/tables/data/koryak_cyrillic.yml) | Koryak (Cyrillic) | Y | Y | stable |
| [kurdish](../scriptshifter/tables/data/kurdish.yml) | Kurdish | Y | N | stable |
| [kyrgyz_cyrillic](../scriptshifter/tables/data/kyrgyz_cyrillic.yml) | Kyrgyz (Cyrillic) | Y | Y | stable |
| [lithuanian_cyrillic](../scriptshifter/tables/data/lithuanian_cyrillic.yml) | Lithuanian (Cyrillic) | Y | Y | stable |
| [macedonian](../scriptshifter/tables/data/macedonian.yml) | Macedonian | Y | Y | stable |
| [malayalam](../scriptshifter/tables/data/malayalam.yml) | Malayalam | Y | Y | | s-to-r lacks capitalization
| [mansi_cyrillic](../scriptshifter/tables/data/mansi_cyrillic.yml) | Mansi (Cyrillic) | Y | Y | stable |
| [marathi](../scriptshifter/tables/data/marathi.yml) | Marathi | Y | Y | | s-to-r lacks capitalization
| [mari_cyrillic](../scriptshifter/tables/data/mari_cyrillic.yml) | Mari (Cyrillic) | Y | Y | stable |
| [moldovan_cyrillic](../scriptshifter/tables/data/moldovan_cyrillic.yml) | Moldovan (Cyrillic) | Y | Y | stable |
| [mongolian_cyrillic](../scriptshifter/tables/data/mongolian_cyrillic.yml) | Mongolian (Cyrillic) | Y | Y | stable |
| [mongolian_mongol_bichig](../scriptshifter/tables/data/mongolian_mongol_bichig.yml) | Mongolian (Mongol bichig) | Y | Y | stable |
| [mordvin_cyrillic](../scriptshifter/tables/data/mordvin_cyrillic.yml) | Mordvin (Cyrillic) | Y | Y | stable |
| [nenets_cyrillic](../scriptshifter/tables/data/nenets_cyrillic.yml) | Nenets (Cyrillic) | Y | Y | stable |
| [nepali_devanagari](../scriptshifter/tables/data/nepali_devanagari.yml) | Nepali (Devanagari) | Y | Y | |
| [newari_devanagari](../scriptshifter/tables/data/newari_devanagari.yml) | Newari (Devanagari) | Y | Y | |
| [oriya](../scriptshifter/tables/data/oriya.yml) | Oriya | Y | Y | | s-to-r lacks capitalization
| [ossetic_cyrillic](../scriptshifter/tables/data/ossetic_cyrillic.yml) | Ossetic (Cyrillic) | Y | Y | stable |
| [pali](../scriptshifter/tables/data/pali.yml) | Pali | Y | Y | |
| [panjabi](../scriptshifter/tables/data/panjabi.yml) | Punjabi (Gurmukhi) | Y | Y | | s-to-r lacks capitalization
| [persian](../scriptshifter/tables/data/persian.yml) | Persian | Y | N | stable |
| [prakrit_devanagari](../scriptshifter/tables/data/prakrit_devanagari.yml) | Prakrit (Devanagari) | Y | Y | |
| [pulaar](../scriptshifter/tables/data/pulaar.yml) | Pulaar (Adlam) | Y | Y | |
| [pushto](../scriptshifter/tables/data/pushto.yml) | Pushto | Y | N | stable |
| [rajasthani_devanagari](../scriptshifter/tables/data/rajasthani_devanagari.yml) | Rajasthani (Devanagari) | Y | Y | |
| [romani_cyrillic](../scriptshifter/tables/data/romani_cyrillic.yml) | Romani (Cyrillic) | Y | Y | stable |
| [russian](../scriptshifter/tables/data/russian.yml) | Russian | Y | Y | stable |
| [sanskrit_devanagari](../scriptshifter/tables/data/sanskrit_devanagari.yml) | Sanskrit (Devanagari) | Y | Y | | s-to-r lacks capitalization
| [serbian](../scriptshifter/tables/data/serbian.yml) | Serbian | Y | Y | stable |
| [shor_cyrillic](../scriptshifter/tables/data/shor_cyrillic.yml) | Shor (Cyrillic) | Y | Y | stable |
| [sinhalese_sinhala](../scriptshifter/tables/data/sinhalese_sinhala.yml) | Sinhalese (Sinhala) | Y | Y | | s-to-r lacks capitalization
| [syriac_cyrillic](../scriptshifter/tables/data/syriac_cyrillic.yml) | Syriac (Cyrillic) | Y | Y | stable |
| [tajik_cyrillic](../scriptshifter/tables/data/tajik_cyrillic.yml) | Tajik (Cyrillic) | Y | Y | stable |
| [tamil](../scriptshifter/tables/data/tamil.yml) | Tamil | Y | Y | beta |
| [tamil_brahmi](../scriptshifter/tables/data/tamil_brahmi.yml) | Tamil Brahmi | Y | Y | |
| [tamil_extended](../scriptshifter/tables/data/tamil_extended.yml) | Tamil (extended) | Y | Y | |
| [tatar-kryashen_cyrillic](../scriptshifter/tables/data/tatar.yml) | Tatar-Kryashen (Cyrillic) | Y | Y | stable |
| [tatar_cyrillic](../scriptshifter/tables/data/tatar_cyrillic.yml) | Tatar (Cyrillic) | Y | Y | stable |
| [telugu](../scriptshifter/tables/data/telugu.yml) | Telugu | Y | Y | | s-to-r lacks capitalization
| [thai](../scriptshifter/tables/data/thai.yml) | Thai | Y | Y | |
| [tibetan](../scriptshifter/tables/data/tibetan.yml) | Tibetan | Y | Y | |
| [turkmen_cyrillic](../scriptshifter/tables/data/turkmen_cyrillic.yml) | Turkmen (Cyrillic) | Y | Y | stable |
| [tuvinian_cyrillic](../scriptshifter/tables/data/tuvinian_cyrillic.yml) | Tuvinian (Cyrillic) | Y | Y | stable |
| [udmurt_cyrillic](../scriptshifter/tables/data/udmurt_cyrillic.yml) | Udmurt (Cyrillic) | Y | Y | stable |
| [uighur_cyrillic](../scriptshifter/tables/data/uighur_cyrillic.yml) | Uighur (Cyrillic) | Y | Y | stable |
| [ukrainian](../scriptshifter/tables/data/ukrainian.yml) | Ukrainian | Y | Y | stable |
| [urdu](../scriptshifter/tables/data/urdu.yml) | Urdu | Y | N | stable |
| [uzbek_cyrillic](../scriptshifter/tables/data/uzbek_cyrillic.yml) | Uzbek (Cyrillic) | Y | Y | stable |
| [yakut_cyrillic](../scriptshifter/tables/data/yakut_cyrillic.yml) | Yakut (Cyrillic) | Y | Y | stable |
| [yiddish](../scriptshifter/tables/data/yiddish.yml) | Yiddish | Y | Y | |
| [yuit_cyrillic](../scriptshifter/tables/data/yuit_cyrillic.yml) | Yuit (Cyrillic) | Y | Y | stable |
70 changes: 54 additions & 16 deletions scriptshifter/hooks/greek/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,12 +77,17 @@ def parse_numeral(ctx):
characters mixed with letter characters without a space. Therefore,
"͵ακακαα" would transliterate "1021kaa", and "͵αακαα", "1001kaa".
"""
# Parse thousands.
# Parse ≥1000.
if ctx.src[ctx.cur] == THOUSANDS_PREFIX:
tk = ctx.src[ctx.cur + 1]

try:
ctx.dest_ls.append(str(DIGITS[4][tk]))
# Exception for 2-letter digit.
if ctx.src[ctx.cur + 1: ctx.cur + 3] == "στ":
ctx.dest_ls.append(str(DIGITS[4]["στ"]))
ctx.cur += 1
else:
ctx.dest_ls.append(str(DIGITS[4][tk]))
ctx.cur += 2

except KeyError:
Expand All @@ -104,8 +109,13 @@ def parse_numeral(ctx):
ext[ext_cur] = str(DIGITS[3 - i][ctx.src[ctx.cur]])
ctx.cur += 1
except KeyError:
# If the number char is not in the correct position, pad with 0
continue
# Exception for 2-letter digit.
if i == 2 and ctx.src[ctx.cur: ctx.cur + 2] == "στ":
ext[ext_cur] = "6"
ctx.cur += 2
else:
# If the char is not in the correct position, pad with 0.
continue
finally:
ext_cur += 1
ctx.dest_ls.extend(ext)
Expand All @@ -119,23 +129,51 @@ def parse_numeral(ctx):
# transliterated characters.
if ctx.src[ctx.cur] == NUM_SUFFIX:
# Move back up to 3 positions.
for i in range(1, 4):
cur = ctx.cur - i
offset = 0 # Added offset if στ is found.
parsed = 0 # Parsed numeral to replace the alpha characters.
breakout = False # Break out of i loop.

i = 1 # Current position in the numeral. 1 = units, 2 = tens, etc.
mark_pos = ctx.cur # Mark this position to resume parsing later.
while i < 4:
if breakout:
break
cur = ctx.cur - i - offset
if cur >= 0:
num_tk = ctx.src[cur] # Number to be parsed
if ctx.src[cur] in DIGITS[i]:
# Not yet reached word boundary.
ctx.dest_ls[-i] = str(DIGITS[i][num_tk])
else:
if ctx.src[cur] != " ": # Word boundary.
# Something's wrong.
# Exception for στ. Scan one character farther left.
if ctx.src[cur - 1:cur + 1] == "στ":
num_tk = "στ"
offset = 1
for j in range(i, 4):
i = j
if num_tk in DIGITS[j]:
# Not yet reached word boundary.
parsed += DIGITS[j][num_tk] * 10 ** (j - 1)
break

if num_tk == " " or cur == 0: # Word boundary.
breakout = True
break

# If we got here we tried all positions without finding a
# match. Something's wrong.
if j == 3:
# continue
ctx.warnings.append(
f"Character `{ctx.src[cur] }` at position "
f"Character `{num_tk}` at position "
f"{cur} is not a valid digit character "
f"at place #{4 - i} in a numeral.")

ctx.cur += 1
return CONT # Continue normal parsing.
# ctx.cur += 1 + offset
# return CONT # Continue normal parsing.
i += 1

if parsed > 0:
ctx.dest_ls = (
ctx.dest_ls[:mark_pos - len(str(parsed)) - offset]
+ [str(parsed)])

ctx.cur = mark_pos + 1 # Skip past numeral suffix.

ctx.cur += 1
return CONT
2 changes: 2 additions & 0 deletions scriptshifter/tables/data/greek_classical.yml
Original file line number Diff line number Diff line change
Expand Up @@ -344,6 +344,7 @@ script_to_roman:
"\u037C": "(."
"\u037D": ".)"
"\u037E": "?\u0333"
";": "?"
"\u037F": "J"
# \u0380 reserved
# \u0381 reserved
Expand Down Expand Up @@ -594,6 +595,7 @@ script_to_roman:
".)\u0333": "\u03FF"
".)": "\u037D"
"?\u0333": "\u037E"
"?": "\u037E"
"\"\u0332": "\u201C"
"\"\u0333": "\u201D"
"'\u0332": "\u2018"
Expand Down
5 changes: 4 additions & 1 deletion tests/data/script_samples/greek.csv
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ greek_classical,ἀΰπνους νύκτας ἴαυον,aypnous nyktas iauon,,
greek_classical,Λητοῦς καὶ Διὸς υἱός,Lētous kai Dios huios,,
greek_classical,ὑϊκὸν πάσχειν,hyikon paschein,,
greek_classical,εἶπε πρὸς τὸν ἄνδρα τὸν ἑωυτῆς,eipe pros ton andra ton heōutēs,,
greek_classical,τί τοῦδ’ ἂν εὕρημ’ ηὗρον εὐτυχέστερον;,ti toud’ an heurēm’ hēuron eutychesteron,,
greek_classical,τί τοῦδ’ ἂν εὕρημ’ ηὗρον εὐτυχέστερον;,ti toud’ an heurēm’ hēuron eutychesteron?,,
greek_classical,Τοῦ Κατὰ πασῶν αἱρέσεων ἐλέγχου βιβλίον αʹ,Tou Kata pasōn haireseōn elenchou biblion 1,,
greek_classical,καλὸν κἀγαθόν,kalon kagathon,,
greek_classical,ᾤχοντο θοἰμάτιον λαβόντες μου,ōchonto thoimation labontes mou,,
Expand All @@ -21,6 +21,9 @@ greek_classical,ἄλαϲτα δὲ ϝέργα πάθον κακὰ μηϲαμέ
greek_classical,Δαμαρέτα τ’ ἐρατά τε Ϝιανθεμίϲ,Damareta t’ erata te Wianthemis,,
greek_classical,ξένϝος,xenwos,,
greek_classical,Πάτροϙλος,Patroḳlos,,
greek_classical,"λβʹ. Ἐπεὶ δὲ ἡ τύχη κράτιστον ἐπὶ πάντα τὰ ἀνθρώπεια, μηδὲ Ἡλιόδωρος ἀπαξιούσθω σοφιστῶν κύκλου παράδοξον ἀγώνισμα τύχης γενόμενος·","32. Epei de ē tychi kratiston epi panta ta anthrōpeia, mide Hēliodōros apaxiousthō sophistōn kyklou paradoxon agōnisma tychis genomenos",,
greek_classical,"κζʹ. Μὴ δεύτερα τῶν προειρημένων σοφιστῶν μηδὲ Ἱππόδρομόν τις ἡγείσθω τὸν Θετταλόν, τῶν μὲν γὰρ βελτίων φαίνεται, τῶν δὲ οὐκ οἶδα ὅ τι λείπεται","27. Mē deutera tōn proeirēmenōn sophistōn mide Ippodromon tis ēgeisthō ton Thettalon, tōn men gar beltiōn phainetai, tōn de ouk oida o ti leipetai",,
greek_classical,"ιγʹ. Πῶλον δὲ τὸν Ἀκραγαντῖνον Γοργίας σοφιστὴν ἐξεμελέτησε πολλῶν, ὥς φασι, χρημάτων, καὶ γὰρ δὴ καὶ τῶν πλουτούντων ὁ Πῶλος.","13. Pōlon de ton Akragantinon Gorgias sophistēn exemeletēse pollōn, ōs phasi, chrēmatōn, kai gar dē kai tōn ploutountōn o Pōlos",,
greek_modern,"Ἐτήσια ἔκθεσις / Κυπριακὴ Δημοκρατία, Ὑπουργεῖον Ἐργασίας καὶ Κοινωνικῶν Ἀσφαλίσεων","Etēsia ekthesis / Kypriakē Dēmokratia, Hypourgeion Ergasias kai Koinōnikōn Asphaliseōn",,
greek_modern,"Ετήσια έκθεση / Κυπριακή Δημοκρατία, Υπουργείο Εργασίας και Κοινωνικών Ασφαλίσεων","Etēsia ekthesē / Kypriakē Dēmokratia, Hypourgeio Ergasias kai Koinōnikōn Asphaliseōn",,
greek_modern,Ελληνικό Ίδρυμα Ευρωπαϊκής και Εξωτερικής Πολιτικής,Hellēniko Hidryma Eurōpaikēs kai Exōterikēs Politikēs,,
Expand Down

0 comments on commit 5bb20bf

Please sign in to comment.