-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Final assignment #3
base: sybille
Are you sure you want to change the base?
Changes from 6 commits
5e80120
ba16fa4
793b45a
6da366e
47df909
12a3cad
229d6db
d473471
3612e11
e2ee48c
7d8cefc
cda27d5
b75a20c
dac993f
766d863
ab372aa
94eb734
13698b7
d27a617
4fd92f9
fdca814
183b52d
69f30d7
c715751
db81406
bd0cc63
18e955f
7a46a5a
0fcf15e
9fbb7dd
a0331e9
aca8780
27ff0d1
fefca57
9744a1d
f6b241f
09fa561
3d7b0d4
9b84b9c
f09dee4
58f2ff7
39285c5
4ea8c53
c9c4048
d062efd
f0d258b
dabaa07
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,43 +1,25 @@ | ||
# A script to make you proud | ||
|
||
This repository contains a small Python program that shows that I have learned Python in this semester. | ||
|
||
The code has been developed by Mariana Montes. | ||
|
||
## Installation and usage | ||
|
||
Clone this repository with the following git command in the console (or download the ZIP file via GitHub): | ||
|
||
```sh | ||
git clone [email protected]:montesmariana/intro_machine_learning_using_python | ||
``` | ||
|
||
You can import the script as a module by adding the repository to your path in a Python script or interactive notebook and then calling `import`. | ||
|
||
```python | ||
import sys | ||
sys.path.append('/path/to/intro_machine_learning_using_python') | ||
import script as s | ||
``` | ||
|
||
Check out `tutorial.md` to see an example of the different functionalities of the script! | ||
|
||
You can also run the script directly with the following code in a console: | ||
|
||
```sh | ||
python script.py <example.json> | ||
``` | ||
|
||
Or in Jupyter notebook with: | ||
|
||
```python | ||
%run script.py <example.json> | ||
``` | ||
|
||
In both cases `example.json` stands for the `filename` argument that the script needs. You can use [the file in this repository](example.json) or a similar file of yours. Find more information on how this script works with: | ||
|
||
```sh | ||
python script.py --help | ||
``` | ||
|
||
If you run this script, you become proud of yourself. | ||
# What I'm planning | ||
|
||
- Start from the translation job management script created for the first assignment, but expand on it to make it useful for a translation agency rather than an independent translator. | ||
- Three class attributes (strings) : | ||
- Translator | ||
- Revisor | ||
- Status | ||
|
||
-> The default value for "Translator" and "Revisor" is "Internal", meaning that an employee of the translation agency took up the job. If the agency assigned the job to a freelancer, the default value can be changed to their name. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, you could also consider having a "Contact" class that has more information about a person and that could be retrieved from some database / json / csv... e.g. something with name, surname, languages, email, phone, availabilities... and this class can be used for all three attributes. |
||
|
||
-> The default value for "Status" is "Created". The status can then be updated as the project progresses to "In translation", "In revision", "Delivered", "Delayed" or "Cancelled". If possible, the script should only accept these six labels to prevent organisational chaos due to everyone using their own labels. | ||
- Instance and computed attributes remain the same as in the first assignment (with some edits to add the advice from the first assignment's feedback). | ||
- Add validation for unexpected input (+ for "Status" labels different from the six authorised labels?) | ||
- Add methods (?) to call the computed attributes and get a result that's more legible than what this currently generates (for example "22 days" instead of "datetime.timedelta(days=22)") | ||
- The input will still be read from a list of dictionaries in a separate json-file. Those dictionaries will be described in a separate markdown file. | ||
|
||
# What I'd like to add but don't know how | ||
|
||
- It would be neat if I could do something with the translation memory and termbase of each project. Maybe add a method that opens them for a preview? | ||
- It would also be super useful to have a way to align a source and a target text and generate a file that can be added to a translation memory. So, to start from two docx-files (or txt-files), split them into sentences and pair each sentence in the source text with the corresponding sentence in the target text and generate a single xml-file with the paired sentences. Ideally, the context (i.e. the surrounding sentences) should also be considered, but if that's not possible an aligned xml would already be awesome. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't sound too complicated, at least if you can have both source and target texts as lists of sentences and each sentence has a corresponding one on the other text with the same index. (Think of With plain text files (e.g. txt) this will be much easier. With docx files, not so much. I found the following package that opens Word files with Python, you can try it out and see what you get: https://python-docx.readthedocs.io/en/latest/user/documents.html Then I'd need to see what the aligned xml looks like to get an idea of what the output should be. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! I just checked and csv is also a good format for TM import and it has a much "lighter" structure, so I'm going to work with csv rather than xml. :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Both nltk and spacy offer sentence segmentizers. Based on the languages you need and what you feel more comfortable with, you could use one or the other. ExampleGiven a running text nltk from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text, language = 'english') "english" is actually the default value of spacy import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
sentences = doc.sents Here the language is defined in the name of the model given to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In any case, you still have to know that given a pair of texts, the first sentence of one corresponds to the first sentence of the other, the second to the second, and so on... |
||
|
||
# Things I'm not yet sure how to integrate into the assignment | ||
|
||
- Use of argparse. | ||
- Use of regular expressions. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
Entry_ID,Entry_Subject,Entry_Domain,Entry_ClientID,Entry_ProjectID,Entry_Created,Entry_Creator,English,French | ||
0,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:40:45 PM,Sibylle,machine translation,traduction automatique | ||
1,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:43:07 PM,Sibylle,advanced topics,sujets avancés | ||
2,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:43:42 PM,Sibylle,translator,traducteur | ||
3,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:43:58 PM,Sibylle,localiser,localisateur | ||
4,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:44:10 PM,Sibylle,revisor,réviseur | ||
5,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:44:27 PM,Sibylle,website,site web | ||
6,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:44:43 PM,Sibylle,software,software | ||
7,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:44:53 PM,Sibylle,game,jeu | ||
8,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:45:02 PM,Sibylle,subtitler,sous-titreur | ||
9,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:45:12 PM,Sibylle,post-editor,post-éditeur | ||
10,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:45:40 PM,Sibylle,technical writer,rédacteur technique | ||
11,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:46:02 PM,Sibylle,computational linguist,linguiste informatique | ||
12,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:46:15 PM,Sibylle,translation technology,technologies de la traduction | ||
13,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:46:34 PM,Sibylle,work placement,stage | ||
14,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:48:18 PM,Sibylle,data,données | ||
15,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:50:06 PM,Sibylle,statistical,statistique | ||
16,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:50:31 PM,Sibylle,neural,neuronal | ||
17,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:50:41 PM,Sibylle,hybrid,hybride | ||
18,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:50:51 PM,Sibylle,adaptive,adaptatif | ||
19,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:53:07 PM,Sibylle,MT,traduction automatique | ||
20,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:53:31 PM,Sibylle,MT engine,engin de traduction automatique | ||
21,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:54:20 PM,Sibylle,post-editing,post-édition | ||
22,Postgraduate programme translation technology,Translation technology,KU Leuven,Intro to Python,5/8/2023 3:55:51 PM,Sibylle,pre-editing,pré-édition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds very good!