Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sai: Lowercase converter in Python #190

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions S23/b-sai/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@

def to_lowercase(word: str, language: str):
"""
word: str, the string to be converted to lowercase
language: str, the language of the string, in BCP 47 format
"""
result = ""

if language.startswith(("zh", "th", "ja")):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3-letter language code permitted in BCP-47, so "startswith" won't work here, e.g. "jam" is "Jamaican Creole English".

return word.lower()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the point was in these cases to not bother calling lower() as an optimization.


for idx, letter in enumerate(word):

lower_letter = letter.lower()
if language == 'tr' or language == 'az':
if letter == 'I':
lower_letter = 'ı'
elif language.startswith(('gd', 'gv', 'ga')):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above; this won't work.

is_2nd_letter = idx == 1
is_exception_letter = letter in [
'A', 'E', 'I', 'O', 'U', 'Á', 'É', 'Í', 'Ó', 'Ú', "Ó"]
is_letter_o_latin = ord(letter) in [211]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic numbers, here and 771 below! Unreadable.

is_beginning_exception = word[0] in ['n', 't']
is_not_last = len(word)-idx > 1
if is_2nd_letter and (is_exception_letter or is_letter_o_latin) and is_beginning_exception and (is_not_last and ord(word[idx+1]) != 771):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Unicode business needs a bit more work; will discuss in class.

lower_letter = "-"+letter.lower()
elif language.startswith('el'):
if letter == 'Σ' and idx == len(word)-1:
lower_letter = 'ς'

result += lower_letter

return result


with open("tests.tsv", "r", encoding="utf-8") as f:
tests = f.read().splitlines()

num_correct = 0
for test in tests:
word, language, actual = test.split("\t")
predicted = to_lowercase(word, language)
if predicted != actual:
print(f"COuldn't convert {word} in {language}!")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small typo.

print(f"Actual: {actual}")
print(f"Predicted: {predicted}")
else:
num_correct += 1

print(f"Successfully completed {num_correct}/{len(tests)} tests")
5 changes: 5 additions & 0 deletions S23/b-sai/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
### b-sai lowercase converter

This is a simple tool to convert uppercase letters to lowercase letters in a text file in any language

To run the python script simply run ```python3 main.py``` from the S23/b-sai/ directory.
7 changes: 7 additions & 0 deletions S23/b-sai/tests.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,10 @@ KASIM en kasim
ΠΌΛΗΣ el πόλης
官话 zh-Hans 官话
ภาษาไทย th ภาษาไทย
車 ja 車
うさぎ ja うさぎ
ลา th ลา
ลิง th ลิง
ΚΑΘΙΣΤΕ el καθιστε
comPuTer_**#science en computer_**#science
tACKY en tacky