Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Final assignment #3

Open
wants to merge 47 commits into
base: sybille
Choose a base branch
from
Open

WIP: Final assignment #3

wants to merge 47 commits into from

Conversation

Sib007
Copy link

@Sib007 Sib007 commented May 2, 2023

No description provided.

Sib007 and others added 6 commits May 2, 2023 11:37
For reference, these are examples of translation memory (TM) and termbase (TB) formats compatible with most CAT-tools (computer-assisted translation tools).
README.md Outdated
# What I'd like to add but don't know how

- It would be neat if I could do something with the translation memory and termbase of each project. Maybe add a method that opens them for a preview?
- It would also be super useful to have a way to align a source and a target text and generate a file that can be added to a translation memory. So, to start from two docx-files (or txt-files), split them into sentences and pair each sentence in the source text with the corresponding sentence in the target text and generate a single xml-file with the paired sentences. Ideally, the context (i.e. the surrounding sentences) should also be considered, but if that's not possible an aligned xml would already be awesome.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't sound too complicated, at least if you can have both source and target texts as lists of sentences and each sentence has a corresponding one on the other text with the same index. (Think of zip())

With plain text files (e.g. txt) this will be much easier. With docx files, not so much. I found the following package that opens Word files with Python, you can try it out and see what you get: https://python-docx.readthedocs.io/en/latest/user/documents.html

Then I'd need to see what the aligned xml looks like to get an idea of what the output should be.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I just checked and csv is also a good format for TM import and it has a much "lighter" structure, so I'm going to work with csv rather than xml. :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both nltk and spacy offer sentence segmentizers. Based on the languages you need and what you feel more comfortable with, you could use one or the other.

Example

Given a running text text (as one string), we try to get a list of sentences sentences.

nltk

from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text, language = 'english')

"english" is actually the default value of language, but you can provide others. The only docs I found about which was this: https://stackoverflow.com/questions/15111183/what-languages-are-supported-for-nltk-word-tokenize-and-nltk-pos-tag

spacy

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
sentences = doc.sents

Here the language is defined in the name of the model given to spacy.load(), and can have other values: https://spacy.io/usage/models#languages

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, you still have to know that given a pair of texts, the first sentence of one corresponds to the first sentence of the other, the second to the second, and so on...

README.md Outdated
```

If you run this script, you become proud of yourself.
# What I'm planning
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds very good!

README.md Outdated
- Revisor
- Status

-> The default value for "Translator" and "Revisor" is "Internal", meaning that an employee of the translation agency took up the job. If the agency assigned the job to a freelancer, the default value can be changed to their name.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you could also consider having a "Contact" class that has more information about a person and that could be retrieved from some database / json / csv... e.g. something with name, surname, languages, email, phone, availabilities... and this class can be used for all three attributes.

@montesmariana
Copy link
Owner

If you want me to review code, please submit it in a plain text file (e.g. a script in a .py file) instead of a Jupyter Notebook, so it's more readable without having to download it :)

Sib007 and others added 12 commits June 2, 2023 14:53
Removing files that are no longer necessary
Removing the files that are no longer necessary.
Removing the files that are no longer necessary.
Removing the files that are no longer necessary.
Removing the files that are no longer necessary.
Removing the files that are no longer necessary.

class Project:

def __init__(self, title, client, source, target, words, start, deadline, price, tm, translator = 'internal', revisor = 'internal', status = 'created', domain = ''):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, I would recommend separating the arguments in different lines.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks for the tip! :)

parser.add_argument("--domain", type=str, default = "", help="Overall domain of the text")
args = parser.parse_args()
proj = Project(args.title, args.client, args.source, args.target, args.words, args.start, args.deadline, args.price, args.tm, args.translator, args.revisor, args.status, args.domain)
proj.days_left()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will need to explicitly print() your variables for them to be visible when running the script.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to add it, but I'm not sure I've done it correctly. Could you have a look? Many thanks in advance!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding print() like you did in line 229 should work (also for anything else you have to print.
You can test it by running the script! Sorry it was never going to work with the wrong "__main" above...

script.py Outdated
self.words = words
if type(start) != str:
raise TypeError("The start date must be provided as a string.")
elif not re.match ("[0-9]{4}-[0-9]{2}-[0-9]{2}", start):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For validation of your ISO Format dates: the regex is not super robust (you could have 0000-00-00 and it would match, or 0123-40-54...), so I would try the following instead:

try:
    self.start = datetime.date.fromisoformat(start)
except:
   raise TypeError("The start date must be provided in ISOFormat")

This has the advantage of also already converting your date :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're worried about using regex, you can always add it as email validator of your Freleancer class ;)

Copy link
Author

@Sib007 Sib007 Jun 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it, but the problem is that then, when I call the attribute "deadline", I once again have the problem that it's fairly "cumbersome"-looking instead of simply displaying the ISO-format date (the string). So, it's a more efficient code, but it's not very user-friendly in attribute display. I therefore used the "try" with the computed attributes st and dl and still kept the string as the value for start and deadline. Could you have a look and say if it's alright that way? Many thanks in advance!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to what I've mentioned before of having private and public attributes. You could have _start and _deadline (or st and dl) as the private versions used in computing other things, and just start and deadline for printing. Or you could have a method that returns the isoformat version of a date.

In any case, according to the docs, if you have a value of class date, you can turn them back into isoformat with .isoformat() or just print them. That is, even if start is a date, my_project.start.isoformat() will print it nicely and print(my_project.start) (instead of just my_project.start) will also print it nicely.

Copy link
Owner

@montesmariana montesmariana Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I said that for checking your code it's better if you show me a script instead of a notebook (in terms of commenting the PR), but the difference still stands: definitions in the scripts, actual code running in the notebooks. In particular, if you want me to check how you defined the classes, THAT has to be in the script, but not the test data and examples.
(And here the examples don't work because you are redefining the variables, thus the script itself crashes)

epilog = "You get an overview of the agency's projects.")
parser.add_argument("filename",
type=argparse.FileType("r", encoding="utf-8"),
help="The list of translation projects")
args = parser.parse_args()
proj = Project(args.title, args.client, args.source, args.target, args.words, args.start, args.deadline, args.price, args.tm, args.translator, args.revisor, args.status, args.domain)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are still assuming other arguments provided via the command line. Once you have replaced them wit only the filename, you have to work with the file (args.filename); args.title, args.client etc don't exist anymore.

@@ -201,23 +215,16 @@ def __str__(self):
return "\n".join([sent_1, sent_2, sent_3, sent_4, sent_5, sent_6, sent_7])

if __name__ == "__main":
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I made a mistake in the template, it should be if __name__ == "__main__":

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise nothing under it will ever run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants