WIP: Final assignment #3

Sib007 · 2023-05-02T10:27:42Z

No description provided.

For reference, these are examples of translation memory (TM) and termbase (TB) formats compatible with most CAT-tools (computer-assisted translation tools).

montesmariana · 2023-05-09T07:03:01Z

README.md

+# What I'd like to add but don't know how
+
+- It would be neat if I could do something with the translation memory and termbase of each project. Maybe add a method that opens them for a preview?
+- It would also be super useful to have a way to align a source and a target text and generate a file that can be added to a translation memory. So, to start from two docx-files (or txt-files), split them into sentences and pair each sentence in the source text with the corresponding sentence in the target text and generate a single xml-file with the paired sentences. Ideally, the context (i.e. the surrounding sentences) should also be considered, but if that's not possible an aligned xml would already be awesome.


This doesn't sound too complicated, at least if you can have both source and target texts as lists of sentences and each sentence has a corresponding one on the other text with the same index. (Think of zip())

With plain text files (e.g. txt) this will be much easier. With docx files, not so much. I found the following package that opens Word files with Python, you can try it out and see what you get: https://python-docx.readthedocs.io/en/latest/user/documents.html

Then I'd need to see what the aligned xml looks like to get an idea of what the output should be.

Thanks! I just checked and csv is also a good format for TM import and it has a much "lighter" structure, so I'm going to work with csv rather than xml. :)

Both nltk and spacy offer sentence segmentizers. Based on the languages you need and what you feel more comfortable with, you could use one or the other.

Example

Given a running text text (as one string), we try to get a list of sentences sentences.

nltk

from nltk.tokenize import sent_tokenize sentences = sent_tokenize(text, language = 'english')

"english" is actually the default value of language, but you can provide others. The only docs I found about which was this: https://stackoverflow.com/questions/15111183/what-languages-are-supported-for-nltk-word-tokenize-and-nltk-pos-tag

spacy

import spacy nlp = spacy.load('en_core_web_sm') doc = nlp(text) sentences = doc.sents

Here the language is defined in the name of the model given to spacy.load(), and can have other values: https://spacy.io/usage/models#languages

In any case, you still have to know that given a pair of texts, the first sentence of one corresponds to the first sentence of the other, the second to the second, and so on...

montesmariana · 2023-05-09T07:04:46Z

README.md

-```
-
-If you run this script, you become proud of yourself.
+# What I'm planning


Sounds very good!

montesmariana · 2023-05-09T07:06:18Z

README.md

+    - Revisor
+    - Status
+
+   -> The default value for "Translator" and "Revisor" is "Internal", meaning that an employee of the translation agency took up the job. If the agency assigned the job to a freelancer, the default value can be changed to their name.


Yeah, you could also consider having a "Contact" class that has more information about a person and that could be retrieved from some database / json / csv... e.g. something with name, surname, languages, email, phone, availabilities... and this class can be used for all three attributes.

Backup for presentation on 23-05.

montesmariana · 2023-05-23T16:16:04Z

If you want me to review code, please submit it in a plain text file (e.g. a script in a .py file) instead of a Jupyter Notebook, so it's more readable without having to download it :)

Removing files that are no longer necessary

Removing the files that are no longer necessary.

montesmariana · 2023-06-06T07:01:14Z

script-with-argparse.py

+
+class Project:
+
+    def __init__(self, title, client, source, target, words, start, deadline, price, tm, translator = 'internal', revisor = 'internal', status = 'created', domain = ''):


For readability, I would recommend separating the arguments in different lines.

Done, thanks for the tip! :)

montesmariana · 2023-06-06T07:02:19Z

script-with-argparse.py

+    parser.add_argument("--domain", type=str, default = "", help="Overall domain of the text")
+    args = parser.parse_args()
+    proj = Project(args.title, args.client, args.source, args.target, args.words, args.start, args.deadline, args.price, args.tm, args.translator, args.revisor, args.status, args.domain)
+    proj.days_left()


you will need to explicitly print() your variables for them to be visible when running the script.

I tried to add it, but I'm not sure I've done it correctly. Could you have a look? Many thanks in advance!

Just adding print() like you did in line 229 should work (also for anything else you have to print.
You can test it by running the script! Sorry it was never going to work with the wrong "__main" above...

montesmariana · 2023-06-06T07:35:17Z

script.py

+            self.words = words
+        if type(start) != str:
+            raise TypeError("The start date must be provided as a string.")
+        elif not re.match ("[0-9]{4}-[0-9]{2}-[0-9]{2}", start):


For validation of your ISO Format dates: the regex is not super robust (you could have 0000-00-00 and it would match, or 0123-40-54...), so I would try the following instead:

try: self.start = datetime.date.fromisoformat(start) except: raise TypeError("The start date must be provided in ISOFormat")

This has the advantage of also already converting your date :)

If you're worried about using regex, you can always add it as email validator of your Freleancer class ;)

I changed it, but the problem is that then, when I call the attribute "deadline", I once again have the problem that it's fairly "cumbersome"-looking instead of simply displaying the ISO-format date (the string). So, it's a more efficient code, but it's not very user-friendly in attribute display. I therefore used the "try" with the computed attributes st and dl and still kept the string as the value for start and deadline. Could you have a look and say if it's alright that way? Many thanks in advance!

This is similar to what I've mentioned before of having private and public attributes. You could have _start and _deadline (or st and dl) as the private versions used in computing other things, and just start and deadline for printing. Or you could have a method that returns the isoformat version of a date.

In any case, according to the docs, if you have a value of class date, you can turn them back into isoformat with .isoformat() or just print them. That is, even if start is a date, my_project.start.isoformat() will print it nicely and print(my_project.start) (instead of just my_project.start) will also print it nicely.

montesmariana · 2023-06-13T10:25:04Z

Final-assignment_trial-and-error_8_linking classes_v2.py

I know I said that for checking your code it's better if you show me a script instead of a notebook (in terms of commenting the PR), but the difference still stands: definitions in the scripts, actual code running in the notebooks. In particular, if you want me to check how you defined the classes, THAT has to be in the script, but not the test data and examples.
(And here the examples don't work because you are redefining the variables, thus the script itself crashes)

montesmariana · 2023-06-13T10:28:16Z

script-with-argparse.py

+        epilog = "You get an overview of the agency's projects.")
+    parser.add_argument("filename",
+                        type=argparse.FileType("r", encoding="utf-8"),
+                        help="The list of translation projects")
    args = parser.parse_args()
    proj = Project(args.title, args.client, args.source, args.target, args.words, args.start, args.deadline, args.price, args.tm, args.translator, args.revisor, args.status, args.domain)


You are still assuming other arguments provided via the command line. Once you have replaced them wit only the filename, you have to work with the file (args.filename); args.title, args.client etc don't exist anymore.

montesmariana · 2023-06-13T10:30:57Z

script-with-argparse.py

@@ -201,23 +215,16 @@ def __str__(self):
        return "\n".join([sent_1, sent_2, sent_3, sent_4, sent_5, sent_6, sent_7])

 if __name__ == "__main":


Sorry, I made a mistake in the template, it should be if __name__ == "__main__":

Otherwise nothing under it will ever run.

These are the final files for the last assignment of Introduction to Python.

Sib007 and others added 6 commits May 2, 2023 11:37

update README

5e80120

update tutorial

ba16fa4

Update README.md

793b45a

update README

6da366e

Update README.md

47df909

Add files via upload

12a3cad

For reference, these are examples of translation memory (TM) and termbase (TB) formats compatible with most CAT-tools (computer-assisted translation tools).

montesmariana reviewed May 9, 2023

View reviewed changes

README.md Outdated

```

If you run this script, you become proud of yourself.

# What I'm planning

Copy link

Owner

montesmariana May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds very good!

montesmariana reviewed May 9, 2023

View reviewed changes

Sib007 added 6 commits May 9, 2023 10:50

Update README.md

229d6db

Update README.md

d473471

Add files via upload

3612e11

Add files via upload

e2ee48c

Add files via upload

7d8cefc

Presentation 23-05

cda27d5

Backup for presentation on 23-05.

Sib007 and others added 12 commits June 2, 2023 14:53

Cleaning up repository

b75a20c

Removing files that are no longer necessary

Cleaning up repository

dac993f

Removing the files that are no longer necessary.

Cleaning up repository

766d863

Removing the files that are no longer necessary.

Cleaning up repository

ab372aa

Removing the files that are no longer necessary.

Cleaning up repository

94eb734

Removing the files that are no longer necessary.

Cleaning up repository

13698b7

Removing the files that are no longer necessary.

first script final assignment

d27a617

update translator, revisor and status validation

4fd92f9

adding argparse to the script, attempt 1

fdca814

adding the TM generator tool

183b52d

revisor -> reviewer, internal lowercase

69f30d7

Adding a Freelancer class, attempt 1

c715751

montesmariana reviewed Jun 6, 2023

View reviewed changes

Sib007 and others added 6 commits June 11, 2023 11:13

update argparse

db81406

solving conflicts

bd0cc63

arguments separate lines

18e955f

print variables argparse

7a46a5a

validation start and deadline

0fcf15e

Linking classes, attempt 2

9fbb7dd

montesmariana reviewed Jun 13, 2023

View reviewed changes

Sib007 added 17 commits June 18, 2023 10:54

Deleting presentation files

a0331e9

Deleting presentation files

aca8780

Deleting presentation files

27ff0d1

Deleting presentation files

fefca57

Deleting presentation files

9744a1d

Deleting presentation files

f6b241f

Deleting draft files

09fa561

Deleting draft files

3d7b0d4

Deleting draft files

9b84b9c

Deleting draft files

f09dee4

Deleting draft files

58f2ff7

Deleting draft files

39285c5

Deleting draft files

4ea8c53

Deleting draft files

c9c4048

Deleting draft files

d062efd

Uploading final files

f0d258b

These are the final files for the last assignment of Introduction to Python.

Deleting superfluous file

dabaa07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Final assignment #3

WIP: Final assignment #3

Sib007 commented May 2, 2023

montesmariana May 9, 2023

Sib007 May 9, 2023

montesmariana May 9, 2023

montesmariana May 9, 2023

montesmariana May 9, 2023

montesmariana May 9, 2023

montesmariana commented May 23, 2023

montesmariana Jun 6, 2023

Sib007 Jun 11, 2023

montesmariana Jun 6, 2023

Sib007 Jun 11, 2023

montesmariana Jun 13, 2023

montesmariana Jun 6, 2023

montesmariana Jun 6, 2023

Sib007 Jun 11, 2023 •

edited

Loading

montesmariana Jun 13, 2023

montesmariana Jun 13, 2023 •

edited

Loading

montesmariana Jun 13, 2023

montesmariana Jun 13, 2023

montesmariana Jun 13, 2023


		class Project:

		def __init__(self, title, client, source, target, words, start, deadline, price, tm, translator = 'internal', revisor = 'internal', status = 'created', domain = ''):

		@@ -201,23 +215,16 @@ def __str__(self):
		return "\n".join([sent_1, sent_2, sent_3, sent_4, sent_5, sent_6, sent_7])

		if __name__ == "__main":

WIP: Final assignment #3

Are you sure you want to change the base?

WIP: Final assignment #3

Conversation

Sib007 commented May 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Example

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

montesmariana commented May 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sib007 Jun 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

montesmariana Jun 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sib007 Jun 11, 2023 •

edited

Loading

montesmariana Jun 13, 2023 •

edited

Loading