Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Final assignment #3

Open
wants to merge 47 commits into
base: sybille
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
5e80120
update README
Sib007 May 2, 2023
ba16fa4
update tutorial
Sib007 May 2, 2023
793b45a
Update README.md
Sib007 May 2, 2023
6da366e
update README
Sib007 May 8, 2023
47df909
Update README.md
Sib007 May 8, 2023
12a3cad
Add files via upload
Sib007 May 8, 2023
229d6db
Update README.md
Sib007 May 9, 2023
d473471
Update README.md
Sib007 May 9, 2023
3612e11
Add files via upload
Sib007 May 19, 2023
e2ee48c
Add files via upload
Sib007 May 19, 2023
7d8cefc
Add files via upload
Sib007 May 19, 2023
cda27d5
Presentation 23-05
Sib007 May 22, 2023
b75a20c
Cleaning up repository
Sib007 Jun 2, 2023
dac993f
Cleaning up repository
Sib007 Jun 2, 2023
766d863
Cleaning up repository
Sib007 Jun 2, 2023
ab372aa
Cleaning up repository
Sib007 Jun 2, 2023
94eb734
Cleaning up repository
Sib007 Jun 2, 2023
13698b7
Cleaning up repository
Sib007 Jun 2, 2023
d27a617
first script final assignment
Sib007 Jun 2, 2023
4fd92f9
update translator, revisor and status validation
Sib007 Jun 2, 2023
fdca814
adding argparse to the script, attempt 1
Sib007 Jun 2, 2023
183b52d
adding the TM generator tool
Sib007 Jun 2, 2023
69f30d7
revisor -> reviewer, internal lowercase
Sib007 Jun 3, 2023
c715751
Adding a Freelancer class, attempt 1
Sib007 Jun 3, 2023
db81406
update argparse
Sib007 Jun 11, 2023
bd0cc63
solving conflicts
Sib007 Jun 11, 2023
18e955f
arguments separate lines
Sib007 Jun 11, 2023
7a46a5a
print variables argparse
Sib007 Jun 11, 2023
0fcf15e
validation start and deadline
Sib007 Jun 11, 2023
9fbb7dd
Linking classes, attempt 2
Sib007 Jun 12, 2023
a0331e9
Deleting presentation files
Sib007 Jun 18, 2023
aca8780
Deleting presentation files
Sib007 Jun 18, 2023
27ff0d1
Deleting presentation files
Sib007 Jun 18, 2023
fefca57
Deleting presentation files
Sib007 Jun 18, 2023
9744a1d
Deleting presentation files
Sib007 Jun 18, 2023
f6b241f
Deleting presentation files
Sib007 Jun 18, 2023
09fa561
Deleting draft files
Sib007 Jun 18, 2023
3d7b0d4
Deleting draft files
Sib007 Jun 18, 2023
9b84b9c
Deleting draft files
Sib007 Jun 18, 2023
f09dee4
Deleting draft files
Sib007 Jun 18, 2023
58f2ff7
Deleting draft files
Sib007 Jun 18, 2023
39285c5
Deleting draft files
Sib007 Jun 18, 2023
4ea8c53
Deleting draft files
Sib007 Jun 18, 2023
c9c4048
Deleting draft files
Sib007 Jun 18, 2023
d062efd
Deleting draft files
Sib007 Jun 18, 2023
f0d258b
Uploading final files
Sib007 Jun 18, 2023
dabaa07
Deleting superfluous file
Sib007 Jun 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 25 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,25 @@
# A script to make you proud

This repository contains a small Python program that shows that I have learned Python in this semester.

The code has been developed by Mariana Montes.

## Installation and usage

Clone this repository with the following git command in the console (or download the ZIP file via GitHub):

```sh
git clone [email protected]:montesmariana/intro_machine_learning_using_python
```

You can import the script as a module by adding the repository to your path in a Python script or interactive notebook and then calling `import`.

```python
import sys
sys.path.append('/path/to/intro_machine_learning_using_python')
import script as s
```

Check out `tutorial.md` to see an example of the different functionalities of the script!

You can also run the script directly with the following code in a console:

```sh
python script.py <example.json>
```

Or in Jupyter notebook with:

```python
%run script.py <example.json>
```

In both cases `example.json` stands for the `filename` argument that the script needs. You can use [the file in this repository](example.json) or a similar file of yours. Find more information on how this script works with:

```sh
python script.py --help
```

If you run this script, you become proud of yourself.
# What I'm planning
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds very good!


- Start from the translation job management script created for the first assignment, but expand on it to make it useful for a translation agency rather than an independent translator.
- Three class attributes (strings) :
- Translator
- Revisor
- Status

-> The default value for "Translator" and "Revisor" is "Internal", meaning that an employee of the translation agency took up the job. If the agency assigned the job to a freelancer, the default value can be changed to their name.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you could also consider having a "Contact" class that has more information about a person and that could be retrieved from some database / json / csv... e.g. something with name, surname, languages, email, phone, availabilities... and this class can be used for all three attributes.


-> The default value for "Status" is "Created". The status can then be updated as the project progresses to "In translation", "In revision", "Delivered", "Delayed" or "Cancelled". If possible, the script should only accept these six labels to prevent organisational chaos due to everyone using their own labels.
- Instance and computed attributes remain the same as in the first assignment (with some edits to add the advice from the first assignment's feedback).
- Add validation for unexpected input (+ for "Status" labels different from the six authorised labels?)
- Add methods (?) to call the computed attributes and get a result that's more legible than what this currently generates (for example "22 days" instead of "datetime.timedelta(days=22)")
- The input will still be read from a list of dictionaries in a separate json-file. Those dictionaries will be described in a separate markdown file.

# What I'd like to add but don't know how

- It would be neat if I could do something with the translation memory and termbase of each project. Maybe add a method that opens them for a preview?
- It would also be super useful to have a way to align a source and a target text and generate a file that can be added to a translation memory. So, to start from two docx-files (or txt-files), split them into sentences and pair each sentence in the source text with the corresponding sentence in the target text and generate a single xml-file with the paired sentences. Ideally, the context (i.e. the surrounding sentences) should also be considered, but if that's not possible an aligned xml would already be awesome.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't sound too complicated, at least if you can have both source and target texts as lists of sentences and each sentence has a corresponding one on the other text with the same index. (Think of zip())

With plain text files (e.g. txt) this will be much easier. With docx files, not so much. I found the following package that opens Word files with Python, you can try it out and see what you get: https://python-docx.readthedocs.io/en/latest/user/documents.html

Then I'd need to see what the aligned xml looks like to get an idea of what the output should be.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I just checked and csv is also a good format for TM import and it has a much "lighter" structure, so I'm going to work with csv rather than xml. :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both nltk and spacy offer sentence segmentizers. Based on the languages you need and what you feel more comfortable with, you could use one or the other.

Example

Given a running text text (as one string), we try to get a list of sentences sentences.

nltk

from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text, language = 'english')

"english" is actually the default value of language, but you can provide others. The only docs I found about which was this: https://stackoverflow.com/questions/15111183/what-languages-are-supported-for-nltk-word-tokenize-and-nltk-pos-tag

spacy

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
sentences = doc.sents

Here the language is defined in the name of the model given to spacy.load(), and can have other values: https://spacy.io/usage/models#languages

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, you still have to know that given a pair of texts, the first sentence of one corresponds to the first sentence of the other, the second to the second, and so on...


# Things I'm not yet sure how to integrate into the assignment

- Use of argparse.
- Use of regular expressions.
24 changes: 24 additions & 0 deletions Translation-technology_TBexample.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
Entry_ID,Entry_Subject,Entry_Domain,Entry_ClientID,Entry_ProjectID,Entry_Created,Entry_Creator,English,French
0,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:40:45 PM,Sibylle,machine translation,traduction automatique
1,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:43:07 PM,Sibylle,advanced topics,sujets avancés
2,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:43:42 PM,Sibylle,translator,traducteur
3,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:43:58 PM,Sibylle,localiser,localisateur
4,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:44:10 PM,Sibylle,revisor,réviseur
5,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:44:27 PM,Sibylle,website,site web
6,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:44:43 PM,Sibylle,software,software
7,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:44:53 PM,Sibylle,game,jeu
8,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:45:02 PM,Sibylle,subtitler,sous-titreur
9,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:45:12 PM,Sibylle,post-editor,post-éditeur
10,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:45:40 PM,Sibylle,technical writer,rédacteur technique
11,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:46:02 PM,Sibylle,computational linguist,linguiste informatique
12,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:46:15 PM,Sibylle,translation technology,technologies de la traduction
13,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:46:34 PM,Sibylle,work placement,stage
14,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:48:18 PM,Sibylle,data,données
15,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:50:06 PM,Sibylle,statistical,statistique
16,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:50:31 PM,Sibylle,neural,neuronal
17,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:50:41 PM,Sibylle,hybrid,hybride
18,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:50:51 PM,Sibylle,adaptive,adaptatif
19,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:53:07 PM,Sibylle,MT,traduction automatique
20,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:53:31 PM,Sibylle,MT engine,engin de traduction automatique
21,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:54:20 PM,Sibylle,post-editing,post-édition
22,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:55:51 PM,Sibylle,pre-editing,pré-édition
Binary file added Translation-technology_TBexample.xdl
Binary file not shown.
Loading