-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Final assignment #3
base: sybille
Are you sure you want to change the base?
Conversation
For reference, these are examples of translation memory (TM) and termbase (TB) formats compatible with most CAT-tools (computer-assisted translation tools).
README.md
Outdated
# What I'd like to add but don't know how | ||
|
||
- It would be neat if I could do something with the translation memory and termbase of each project. Maybe add a method that opens them for a preview? | ||
- It would also be super useful to have a way to align a source and a target text and generate a file that can be added to a translation memory. So, to start from two docx-files (or txt-files), split them into sentences and pair each sentence in the source text with the corresponding sentence in the target text and generate a single xml-file with the paired sentences. Ideally, the context (i.e. the surrounding sentences) should also be considered, but if that's not possible an aligned xml would already be awesome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't sound too complicated, at least if you can have both source and target texts as lists of sentences and each sentence has a corresponding one on the other text with the same index. (Think of zip()
)
With plain text files (e.g. txt) this will be much easier. With docx files, not so much. I found the following package that opens Word files with Python, you can try it out and see what you get: https://python-docx.readthedocs.io/en/latest/user/documents.html
Then I'd need to see what the aligned xml looks like to get an idea of what the output should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I just checked and csv is also a good format for TM import and it has a much "lighter" structure, so I'm going to work with csv rather than xml. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both nltk and spacy offer sentence segmentizers. Based on the languages you need and what you feel more comfortable with, you could use one or the other.
Example
Given a running text text
(as one string), we try to get a list of sentences sentences
.
nltk
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text, language = 'english')
"english" is actually the default value of language
, but you can provide others. The only docs I found about which was this: https://stackoverflow.com/questions/15111183/what-languages-are-supported-for-nltk-word-tokenize-and-nltk-pos-tag
spacy
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
sentences = doc.sents
Here the language is defined in the name of the model given to spacy.load()
, and can have other values: https://spacy.io/usage/models#languages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, you still have to know that given a pair of texts, the first sentence of one corresponds to the first sentence of the other, the second to the second, and so on...
README.md
Outdated
``` | ||
|
||
If you run this script, you become proud of yourself. | ||
# What I'm planning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds very good!
README.md
Outdated
- Revisor | ||
- Status | ||
|
||
-> The default value for "Translator" and "Revisor" is "Internal", meaning that an employee of the translation agency took up the job. If the agency assigned the job to a freelancer, the default value can be changed to their name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you could also consider having a "Contact" class that has more information about a person and that could be retrieved from some database / json / csv... e.g. something with name, surname, languages, email, phone, availabilities... and this class can be used for all three attributes.
Backup for presentation on 23-05.
If you want me to review code, please submit it in a plain text file (e.g. a script in a .py file) instead of a Jupyter Notebook, so it's more readable without having to download it :) |
Removing files that are no longer necessary
Removing the files that are no longer necessary.
Removing the files that are no longer necessary.
Removing the files that are no longer necessary.
Removing the files that are no longer necessary.
Removing the files that are no longer necessary.
script-with-argparse.py
Outdated
|
||
class Project: | ||
|
||
def __init__(self, title, client, source, target, words, start, deadline, price, tm, translator = 'internal', revisor = 'internal', status = 'created', domain = ''): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability, I would recommend separating the arguments in different lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks for the tip! :)
script-with-argparse.py
Outdated
parser.add_argument("--domain", type=str, default = "", help="Overall domain of the text") | ||
args = parser.parse_args() | ||
proj = Project(args.title, args.client, args.source, args.target, args.words, args.start, args.deadline, args.price, args.tm, args.translator, args.revisor, args.status, args.domain) | ||
proj.days_left() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you will need to explicitly print()
your variables for them to be visible when running the script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to add it, but I'm not sure I've done it correctly. Could you have a look? Many thanks in advance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just adding print()
like you did in line 229 should work (also for anything else you have to print.
You can test it by running the script! Sorry it was never going to work with the wrong "__main" above...
script.py
Outdated
self.words = words | ||
if type(start) != str: | ||
raise TypeError("The start date must be provided as a string.") | ||
elif not re.match ("[0-9]{4}-[0-9]{2}-[0-9]{2}", start): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For validation of your ISO Format dates: the regex is not super robust (you could have 0000-00-00 and it would match, or 0123-40-54...), so I would try the following instead:
try:
self.start = datetime.date.fromisoformat(start)
except:
raise TypeError("The start date must be provided in ISOFormat")
This has the advantage of also already converting your date :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're worried about using regex, you can always add it as email validator of your Freleancer class ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it, but the problem is that then, when I call the attribute "deadline", I once again have the problem that it's fairly "cumbersome"-looking instead of simply displaying the ISO-format date (the string). So, it's a more efficient code, but it's not very user-friendly in attribute display. I therefore used the "try" with the computed attributes st
and dl
and still kept the string as the value for start
and deadline
. Could you have a look and say if it's alright that way? Many thanks in advance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to what I've mentioned before of having private and public attributes. You could have _start
and _deadline
(or st
and dl
) as the private versions used in computing other things, and just start
and deadline
for printing. Or you could have a method that returns the isoformat version of a date.
In any case, according to the docs, if you have a value of class date
, you can turn them back into isoformat with .isoformat()
or just print them. That is, even if start
is a date
, my_project.start.isoformat()
will print it nicely and print(my_project.start)
(instead of just my_project.start
) will also print it nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I said that for checking your code it's better if you show me a script instead of a notebook (in terms of commenting the PR), but the difference still stands: definitions in the scripts, actual code running in the notebooks. In particular, if you want me to check how you defined the classes, THAT has to be in the script, but not the test data and examples.
(And here the examples don't work because you are redefining the variables, thus the script itself crashes)
script-with-argparse.py
Outdated
epilog = "You get an overview of the agency's projects.") | ||
parser.add_argument("filename", | ||
type=argparse.FileType("r", encoding="utf-8"), | ||
help="The list of translation projects") | ||
args = parser.parse_args() | ||
proj = Project(args.title, args.client, args.source, args.target, args.words, args.start, args.deadline, args.price, args.tm, args.translator, args.revisor, args.status, args.domain) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are still assuming other arguments provided via the command line. Once you have replaced them wit only the filename, you have to work with the file (args.filename
); args.title
, args.client
etc don't exist anymore.
script-with-argparse.py
Outdated
@@ -201,23 +215,16 @@ def __str__(self): | |||
return "\n".join([sent_1, sent_2, sent_3, sent_4, sent_5, sent_6, sent_7]) | |||
|
|||
if __name__ == "__main": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I made a mistake in the template, it should be if __name__ == "__main__":
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise nothing under it will ever run.
These are the final files for the last assignment of Introduction to Python.
No description provided.